Giuseppe Jurman | Bruno Kessler Foundation (original) (raw)
Papers by Giuseppe Jurman
CovMulNet19 is a comprehensive network containing all available known interactions involving SARS... more CovMulNet19 is a comprehensive network containing all available known interactions involving SARS-CoV-2 proteins, interacting-human proteins, diseases and symptoms that are related to these human proteins, and compounds that can potentially target them. Extensive network analysis methods, based on a bootstrap approach, allow us to prioritise a list of diseases that display a high similarity to \covid and a list of drugs that could potentially be beneficial to treat patients. As a key feature of CovMulNet19, the inclusion of symptoms allows a deeper characterization of the disease pathology, representing a useful proxy for CoVid19-related molecular processes. We recapitulate many of the known symptoms of the disease and we find the most similar diseases to COVID-19 reflect conditions that are risk factors in patients. <br>
ArXiv, 2015
Different strategies have been considered to extract information from social media about how simi... more Different strategies have been considered to extract information from social media about how similarly people react to the same news or event. In this context, a powerful method is offered by the application of graph techniques to the contents produced by social network users. In particular, large events typically attract enough content traffic along time to enable an analysis that explicitly models a dependence from the time dimension. Here we demonstrate how it is possible to extend the application of community detection strategies in complex networks to the case of time-dependent multilayer networks, whenever the connection between consecutive time layers is non-trivial. We apply the method to 400K Twitter post related to the Expo event held in Milan (Italy) between May and October 2015.
TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity datas... more TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity dataset collected by the Civil Protection weather radar of the Trentino South Tyrol Region, in the Italian Alps.<br> The dataset includes 894,916 scans of precipitation from more than 9 years of data, offering a novel resource to develop and benchmark analog ensemble models and machine learning solutions for precipitation nowcasting. Data are expressed as 2D images, considering the maximum reflectivity on the vertical section and 5 minutes sampling rate, covering an area of 240km of diameter at 500m horizontal resolution. The TAASRAD19 distribution also includes a curated set of 1,732 sequences, for a total of 362,233 radar images, labeled with precipitation type tags assigned by expert meteorologists. We validated TAASRAD19 as a benchmark for nowcasting using deep learning model to forecast reflectivity and a procedure based on the UMAP dimensionality reduction method for interactive exploration.<br> Software methods for data pre-processing, model training and inference, and a pre-trained model are<br> publicly available at https://github.com/MPBA/TAASRAD19 for replication and reproducibility.
TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity datas... more TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity dataset collected by the Civil Protection weather radar of the Trentino South Tyrol Region, in the Italian Alps.<br> The dataset includes 894,916 scans of precipitation from more than 9 years of data, offering a novel resource to develop and benchmark analog ensemble models and machine learning solutions for precipitation nowcasting. Data are expressed as 2D images, considering the maximum reflectivity on the vertical section and 5 minutes sampling rate, covering an area of 240km of diameter at 500m horizontal resolution. The TAASRAD19 distribution also includes a curated set of 1,732 sequences, for a total of 362,233 radar images, labeled with precipitation type tags assigned by expert meteorologists. We validated TAASRAD19 as a benchmark for nowcasting using deep learning model to forecast reflectivity and a procedure based on the UMAP dimensionality reduction method for interactive exploration.<br> Software methods for data pre-processing, model training and inference, and a pre-trained model are<br> publicly available at https://github.com/MPBA/TAASRAD19 for replication and reproducibility.
Genome Biology, 2021
Background Oncopanel genomic testing, which identifies important somatic variants, is increasingl... more Background Oncopanel genomic testing, which identifies important somatic variants, is increasingly common in medical practice and especially in clinical trials. Currently, there is a paucity of reliable genomic reference samples having a suitably large number of pre-identified variants for properly assessing oncopanel assay analytical quality and performance. The FDA-led Sequencing and Quality Control Phase 2 (SEQC2) consortium analyze ten diverse cancer cell lines individually and their pool, termed Sample A, to develop a reference sample with suitably large numbers of coding positions with known (variant) positives and negatives for properly evaluating oncopanel analytical performance. Results In reference Sample A, we identify more than 40,000 variants down to 1% allele frequency with more than 25,000 variants having less than 20% allele frequency with 1653 variants in COSMIC-related genes. This is 5–100× more than existing commercially available samples. We also identify an unpr...
IEEE Transactions on Network Science and Engineering, 2020
Journal of Environmental Science and Health, Part C, 2018
Springer Handbook of Bio-/Neuroinformatics, 2014
Identifying the molecular pathways more prone to disruption during a pathological process is a ke... more Identifying the molecular pathways more prone to disruption during a pathological process is a key task in network medicine and, more generally, in systems biology. This chapter describes a pipeline that couples a machine learning solution for molecular profiling with a recent network comparison method. The pipeline can identify changes occurring between specific sub-modules of networks built in a case-control biomarker study, discriminating key groups of genes whose interactions are modified by an underlying condition. Different algorithms can be chosen to implement the workflow steps. Three applications on genome-wide data are presented regarding the susceptibility of children to air pollution, and early and late onset of Parkinsonʼs and Alzheimerʼs diseases.
Nature biotechnology, 2014
The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differen... more The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differential gene expression has not been rigorously assessed using a range of chemical treatment conditions. Here we use a comprehensive study design to generate Illumina RNA-seq and Affymetrix microarray data from the same liver samples of rats exposed in triplicate to varying degrees of perturbation by 27 chemicals representing multiple modes of action (MOAs). The cross-platform concordance in terms of differentially expressed genes (DEGs) or enriched pathways is linearly correlated with treatment effect size (R(2)0.8). Furthermore, the concordance is also affected by transcript abundance and biological complexity of the MOA. RNA-seq outperforms microarray (93% versus 75%) in DEG verification as assessed by quantitative PCR, with the gain mainly due to its improved accuracy for low-abundance transcripts. Nonetheless, classifiers to predict MOAs perform similarly when developed using data fro...
PloS one, 2012
The traditional staging system is inadequate to identify those patients with stage II colorectal ... more The traditional staging system is inadequate to identify those patients with stage II colorectal cancer (CRC) at high risk of recurrence or with stage III CRC at low risk. A number of gene expression signatures to predict CRC prognosis have been proposed, but none is routinely used in the clinic. The aim of this work was to assess the prediction ability and potential clinical usefulness of these signatures in a series of independent datasets. A literature review identified 31 gene expression signatures that used gene expression data to predict prognosis in CRC tissue. The search was based on the PubMed database and was restricted to papers published from January 2004 to December 2011. Eleven CRC gene expression datasets with outcome information were identified and downloaded from public repositories. Random Forest classifier was used to build predictors from the gene lists. Matthews correlation coefficient was chosen as a measure of classification accuracy and its associated p-value...
Nature Biotechnology, 2010
International Journal of Cancer, 2006
We analyzed the expression signatures of 14 tumor biopsies from children affected by alveolar rha... more We analyzed the expression signatures of 14 tumor biopsies from children affected by alveolar rhabdomyosarcoma (ARMS) to identify genes correlating to biological features of this tumor. Seven of these patients were positive for the PAX3-FKHR fusion gene and 7 were negative. We used a cDNA platform containing a large majority of probes derived from muscle tissues. The comparison of transcription profiles of tumor samples with fetal skeletal muscle identified 171 differentially expressed genes common to all ARMS patients. The functional classification analysis of altered genes led to the identification of a group of transcripts (LGALS1, BIN1) that may be relevant for the tumorigenic processes. The muscle-specific microarray platform was able to distinguish PAX3-FKHR positive and negative ARMS through the expression pattern of a limited number of genes (RAC1, CFL1, CCND1, IGFBP2) that might be biologically relevant for the different clinical behavior and aggressiveness of the 2 ARMS subtypes. Expression levels for selected candidate genes were validated by quantitative real-time reverse-transcription PCR.
Briefings in Bioinformatics, 2007
Bioinformatics, 2007
Motivation: We propose a method for studying the stability of biomarker lists obtained from funct... more Motivation: We propose a method for studying the stability of biomarker lists obtained from functional genomics studies. It is common to adopt resampling methods to tune and evaluate markerbased diagnostic and prognostic systems in order to prevent selection bias. Such caution promotes honest estimation of class prediction, but leads to alternative sets of solutions. In microarray studies, the difference in lists may be bewildering, also due to the presence of modules of functionally related genes. Methods for assessing stability understand the dependency of the markers on the data or on the predictor's type and help selecting solutions. Results: A computational framework for comparing sets of ranked biomarker lists is presented. Notions and algorithms are based on concepts from permutation group theory. We introduce several algebraic indicators and metric methods for symmetric groups, including the Canberra distance, a weighted version of Spearman's footrule. We also consider distances between partial lists and an aggregation of sets of lists into an optimal list based on voting theory (Borda count). The stability indicators are applied in practical situations to several synthetic, cancer microarray and proteomics datasets. The addressed issues are predictive classification, presence of modules, comparison of alternative biomarker lists, outlier removal, control of selection bias by randomization techniques and enrichment analysis.
ABSTRACT Background: Autism Spectrum Disorder (ASD) symptoms are heterogeneous and hard to discri... more ABSTRACT Background: Autism Spectrum Disorder (ASD) symptoms are heterogeneous and hard to discriminate in distinct subtypes. Although candidate loci have been recently identified by integration of large ASD cohorts (Wang et al. 2009), new bioinformatics methods are needed to cope with high individual variability. The l1-l2 regularization is a feature selection technique capable of generating a specific signature in biologically complex settings. It was applied to detect markers of transcriptional response of neuroblastoma to hypoxia (Fardin et al. 2009), and proposed for predicting quantitative phenotypes traits from high dimensional genetic data (Guzzetta et al 2009). Here we studied its first large scale application to whole genome association data from the AGRE research program. Objectives: We aim to predict Social Responsiveness Scale (SRS) levels by means of a new bioinformatics platform for quantitative phenotype prediction. Although currently there is a limited coverage of the SRS phenotypes in the AGRE cohort, this is a powerful set of indicators that can be used to determine individual trajectories, the ultimate goal for our analysis. Here we set a bioinformatics experiment in which all unfiltered variant positions in the genome are used as potential markers and training is based on extreme value cases. Methods: Given the 2,883 AGRE samples genotyped by the Broad Institute with the Affymetrix 5.0 platform (399,197 SNPs), we first identified 803 individuals with only ADI or ADOS-confirmed autism diagnosis and 1446 healthy controls not tested for ADI. Individuals having a teacher-administered SRS questionnaire were then selected, leaving 144 cases and 19 controls. We considered the highest 17 and lowest 18 SRS total scores (respectively, only cases and only controls). A linear l1l2-regularization regression model was trained on all features, using the SRS total score as target. The experiment protocol was based on the 10x5 FDA’s MAQC-II procedure (5-fold cross-validation repeated 10 times). For the l1l2 parameter set having the best average R^2 score computed from CV test portions, we evaluated the Area under the Curve (AUC) for classification from real predictions (Wilcoxon Mann-Whitney) and ranked the weights corresponding to each selected SNP. Results: AUC was 0.723 (95% CI: 0.684-0.768), with a fit of R2 = 0.237 (95% CI: 0.155-0.331). The same 51,744 SNPs were consistently selected in all experiments. Ranked by regression weights, the top 30 markers all had an average position higher than 150. Of these, 24 belong to only four regions: 3p12.2 (3), 8q21.11 (10), 11p12 (3), 11p14.1 (5), Xp11.4 (3). Near loci on chromosomes 8 and 11 had been previously identified for SRS by Duvall et al (2007). Within the top 500 SNPs, we also found 12 SNPs at loci 5p14.1 (2), 14q21.1 (4) and Xq21.1 (6) indicated as candidate markers for autism (Wang et al 2009). Conclusions: This study represents the first application of a regression method to an autism-related quantitative phenotype. When trained on extreme values of the SRS score, the l1l2 method fairly discriminated cases from controls and explained 23.7% of variance. Moreover, selected markers were stable and consistent with literature. Top ranked markers are being investigated.
IEEE Access, 2021
Even if measuring the outcome of binary classifications is a pivotal task in machine learning and... more Even if measuring the outcome of binary classifications is a pivotal task in machine learning and statistics, no consensus has been reached yet about which statistical rate to employ to this end. In the last century, the computer science and statistics communities have introduced several scores summing up the correctness of the predictions with respect to the ground truth values. Among these scores, the Matthews correlation coefficient (MCC) was shown to have several advantages over confusion entropy, accuracy, F 1 score, balanced accuracy, bookmaker informedness, markedness, and diagnostic odds ratio: MCC, in fact, produces a high score only if the majority of the predicted negative data instances and the majority of the positive data instances are correct, and therefore it results being very trustworthy on imbalanced datasets. In this study, we compare MCC with two other popular scores: Cohen's Kappa, a metric that originated in social sciences, and the Brier score, a strictly proper scoring function which emerged in weather forecasting studies. After explaining the mathematical properties and the relationships between MCC and each of these two rates, we report some use cases where these scores generate different values, which lead to discordant outcomes, where MCC provides a more truthful and informative result. We highlight the reasons why it is more advisable to use MCC rather that Cohen's Kappa and the Brier score to evaluate binary classifications.
CovMulNet19 is a comprehensive network containing all available known interactions involving SARS... more CovMulNet19 is a comprehensive network containing all available known interactions involving SARS-CoV-2 proteins, interacting-human proteins, diseases and symptoms that are related to these human proteins, and compounds that can potentially target them. Extensive network analysis methods, based on a bootstrap approach, allow us to prioritise a list of diseases that display a high similarity to \covid and a list of drugs that could potentially be beneficial to treat patients. As a key feature of CovMulNet19, the inclusion of symptoms allows a deeper characterization of the disease pathology, representing a useful proxy for CoVid19-related molecular processes. We recapitulate many of the known symptoms of the disease and we find the most similar diseases to COVID-19 reflect conditions that are risk factors in patients. <br>
ArXiv, 2015
Different strategies have been considered to extract information from social media about how simi... more Different strategies have been considered to extract information from social media about how similarly people react to the same news or event. In this context, a powerful method is offered by the application of graph techniques to the contents produced by social network users. In particular, large events typically attract enough content traffic along time to enable an analysis that explicitly models a dependence from the time dimension. Here we demonstrate how it is possible to extend the application of community detection strategies in complex networks to the case of time-dependent multilayer networks, whenever the connection between consecutive time layers is non-trivial. We apply the method to 400K Twitter post related to the Expo event held in Milan (Italy) between May and October 2015.
TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity datas... more TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity dataset collected by the Civil Protection weather radar of the Trentino South Tyrol Region, in the Italian Alps.<br> The dataset includes 894,916 scans of precipitation from more than 9 years of data, offering a novel resource to develop and benchmark analog ensemble models and machine learning solutions for precipitation nowcasting. Data are expressed as 2D images, considering the maximum reflectivity on the vertical section and 5 minutes sampling rate, covering an area of 240km of diameter at 500m horizontal resolution. The TAASRAD19 distribution also includes a curated set of 1,732 sequences, for a total of 362,233 radar images, labeled with precipitation type tags assigned by expert meteorologists. We validated TAASRAD19 as a benchmark for nowcasting using deep learning model to forecast reflectivity and a procedure based on the UMAP dimensionality reduction method for interactive exploration.<br> Software methods for data pre-processing, model training and inference, and a pre-trained model are<br> publicly available at https://github.com/MPBA/TAASRAD19 for replication and reproducibility.
TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity datas... more TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity dataset collected by the Civil Protection weather radar of the Trentino South Tyrol Region, in the Italian Alps.<br> The dataset includes 894,916 scans of precipitation from more than 9 years of data, offering a novel resource to develop and benchmark analog ensemble models and machine learning solutions for precipitation nowcasting. Data are expressed as 2D images, considering the maximum reflectivity on the vertical section and 5 minutes sampling rate, covering an area of 240km of diameter at 500m horizontal resolution. The TAASRAD19 distribution also includes a curated set of 1,732 sequences, for a total of 362,233 radar images, labeled with precipitation type tags assigned by expert meteorologists. We validated TAASRAD19 as a benchmark for nowcasting using deep learning model to forecast reflectivity and a procedure based on the UMAP dimensionality reduction method for interactive exploration.<br> Software methods for data pre-processing, model training and inference, and a pre-trained model are<br> publicly available at https://github.com/MPBA/TAASRAD19 for replication and reproducibility.
Genome Biology, 2021
Background Oncopanel genomic testing, which identifies important somatic variants, is increasingl... more Background Oncopanel genomic testing, which identifies important somatic variants, is increasingly common in medical practice and especially in clinical trials. Currently, there is a paucity of reliable genomic reference samples having a suitably large number of pre-identified variants for properly assessing oncopanel assay analytical quality and performance. The FDA-led Sequencing and Quality Control Phase 2 (SEQC2) consortium analyze ten diverse cancer cell lines individually and their pool, termed Sample A, to develop a reference sample with suitably large numbers of coding positions with known (variant) positives and negatives for properly evaluating oncopanel analytical performance. Results In reference Sample A, we identify more than 40,000 variants down to 1% allele frequency with more than 25,000 variants having less than 20% allele frequency with 1653 variants in COSMIC-related genes. This is 5–100× more than existing commercially available samples. We also identify an unpr...
IEEE Transactions on Network Science and Engineering, 2020
Journal of Environmental Science and Health, Part C, 2018
Springer Handbook of Bio-/Neuroinformatics, 2014
Identifying the molecular pathways more prone to disruption during a pathological process is a ke... more Identifying the molecular pathways more prone to disruption during a pathological process is a key task in network medicine and, more generally, in systems biology. This chapter describes a pipeline that couples a machine learning solution for molecular profiling with a recent network comparison method. The pipeline can identify changes occurring between specific sub-modules of networks built in a case-control biomarker study, discriminating key groups of genes whose interactions are modified by an underlying condition. Different algorithms can be chosen to implement the workflow steps. Three applications on genome-wide data are presented regarding the susceptibility of children to air pollution, and early and late onset of Parkinsonʼs and Alzheimerʼs diseases.
Nature biotechnology, 2014
The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differen... more The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differential gene expression has not been rigorously assessed using a range of chemical treatment conditions. Here we use a comprehensive study design to generate Illumina RNA-seq and Affymetrix microarray data from the same liver samples of rats exposed in triplicate to varying degrees of perturbation by 27 chemicals representing multiple modes of action (MOAs). The cross-platform concordance in terms of differentially expressed genes (DEGs) or enriched pathways is linearly correlated with treatment effect size (R(2)0.8). Furthermore, the concordance is also affected by transcript abundance and biological complexity of the MOA. RNA-seq outperforms microarray (93% versus 75%) in DEG verification as assessed by quantitative PCR, with the gain mainly due to its improved accuracy for low-abundance transcripts. Nonetheless, classifiers to predict MOAs perform similarly when developed using data fro...
PloS one, 2012
The traditional staging system is inadequate to identify those patients with stage II colorectal ... more The traditional staging system is inadequate to identify those patients with stage II colorectal cancer (CRC) at high risk of recurrence or with stage III CRC at low risk. A number of gene expression signatures to predict CRC prognosis have been proposed, but none is routinely used in the clinic. The aim of this work was to assess the prediction ability and potential clinical usefulness of these signatures in a series of independent datasets. A literature review identified 31 gene expression signatures that used gene expression data to predict prognosis in CRC tissue. The search was based on the PubMed database and was restricted to papers published from January 2004 to December 2011. Eleven CRC gene expression datasets with outcome information were identified and downloaded from public repositories. Random Forest classifier was used to build predictors from the gene lists. Matthews correlation coefficient was chosen as a measure of classification accuracy and its associated p-value...
Nature Biotechnology, 2010
International Journal of Cancer, 2006
We analyzed the expression signatures of 14 tumor biopsies from children affected by alveolar rha... more We analyzed the expression signatures of 14 tumor biopsies from children affected by alveolar rhabdomyosarcoma (ARMS) to identify genes correlating to biological features of this tumor. Seven of these patients were positive for the PAX3-FKHR fusion gene and 7 were negative. We used a cDNA platform containing a large majority of probes derived from muscle tissues. The comparison of transcription profiles of tumor samples with fetal skeletal muscle identified 171 differentially expressed genes common to all ARMS patients. The functional classification analysis of altered genes led to the identification of a group of transcripts (LGALS1, BIN1) that may be relevant for the tumorigenic processes. The muscle-specific microarray platform was able to distinguish PAX3-FKHR positive and negative ARMS through the expression pattern of a limited number of genes (RAC1, CFL1, CCND1, IGFBP2) that might be biologically relevant for the different clinical behavior and aggressiveness of the 2 ARMS subtypes. Expression levels for selected candidate genes were validated by quantitative real-time reverse-transcription PCR.
Briefings in Bioinformatics, 2007
Bioinformatics, 2007
Motivation: We propose a method for studying the stability of biomarker lists obtained from funct... more Motivation: We propose a method for studying the stability of biomarker lists obtained from functional genomics studies. It is common to adopt resampling methods to tune and evaluate markerbased diagnostic and prognostic systems in order to prevent selection bias. Such caution promotes honest estimation of class prediction, but leads to alternative sets of solutions. In microarray studies, the difference in lists may be bewildering, also due to the presence of modules of functionally related genes. Methods for assessing stability understand the dependency of the markers on the data or on the predictor's type and help selecting solutions. Results: A computational framework for comparing sets of ranked biomarker lists is presented. Notions and algorithms are based on concepts from permutation group theory. We introduce several algebraic indicators and metric methods for symmetric groups, including the Canberra distance, a weighted version of Spearman's footrule. We also consider distances between partial lists and an aggregation of sets of lists into an optimal list based on voting theory (Borda count). The stability indicators are applied in practical situations to several synthetic, cancer microarray and proteomics datasets. The addressed issues are predictive classification, presence of modules, comparison of alternative biomarker lists, outlier removal, control of selection bias by randomization techniques and enrichment analysis.
ABSTRACT Background: Autism Spectrum Disorder (ASD) symptoms are heterogeneous and hard to discri... more ABSTRACT Background: Autism Spectrum Disorder (ASD) symptoms are heterogeneous and hard to discriminate in distinct subtypes. Although candidate loci have been recently identified by integration of large ASD cohorts (Wang et al. 2009), new bioinformatics methods are needed to cope with high individual variability. The l1-l2 regularization is a feature selection technique capable of generating a specific signature in biologically complex settings. It was applied to detect markers of transcriptional response of neuroblastoma to hypoxia (Fardin et al. 2009), and proposed for predicting quantitative phenotypes traits from high dimensional genetic data (Guzzetta et al 2009). Here we studied its first large scale application to whole genome association data from the AGRE research program. Objectives: We aim to predict Social Responsiveness Scale (SRS) levels by means of a new bioinformatics platform for quantitative phenotype prediction. Although currently there is a limited coverage of the SRS phenotypes in the AGRE cohort, this is a powerful set of indicators that can be used to determine individual trajectories, the ultimate goal for our analysis. Here we set a bioinformatics experiment in which all unfiltered variant positions in the genome are used as potential markers and training is based on extreme value cases. Methods: Given the 2,883 AGRE samples genotyped by the Broad Institute with the Affymetrix 5.0 platform (399,197 SNPs), we first identified 803 individuals with only ADI or ADOS-confirmed autism diagnosis and 1446 healthy controls not tested for ADI. Individuals having a teacher-administered SRS questionnaire were then selected, leaving 144 cases and 19 controls. We considered the highest 17 and lowest 18 SRS total scores (respectively, only cases and only controls). A linear l1l2-regularization regression model was trained on all features, using the SRS total score as target. The experiment protocol was based on the 10x5 FDA’s MAQC-II procedure (5-fold cross-validation repeated 10 times). For the l1l2 parameter set having the best average R^2 score computed from CV test portions, we evaluated the Area under the Curve (AUC) for classification from real predictions (Wilcoxon Mann-Whitney) and ranked the weights corresponding to each selected SNP. Results: AUC was 0.723 (95% CI: 0.684-0.768), with a fit of R2 = 0.237 (95% CI: 0.155-0.331). The same 51,744 SNPs were consistently selected in all experiments. Ranked by regression weights, the top 30 markers all had an average position higher than 150. Of these, 24 belong to only four regions: 3p12.2 (3), 8q21.11 (10), 11p12 (3), 11p14.1 (5), Xp11.4 (3). Near loci on chromosomes 8 and 11 had been previously identified for SRS by Duvall et al (2007). Within the top 500 SNPs, we also found 12 SNPs at loci 5p14.1 (2), 14q21.1 (4) and Xq21.1 (6) indicated as candidate markers for autism (Wang et al 2009). Conclusions: This study represents the first application of a regression method to an autism-related quantitative phenotype. When trained on extreme values of the SRS score, the l1l2 method fairly discriminated cases from controls and explained 23.7% of variance. Moreover, selected markers were stable and consistent with literature. Top ranked markers are being investigated.
IEEE Access, 2021
Even if measuring the outcome of binary classifications is a pivotal task in machine learning and... more Even if measuring the outcome of binary classifications is a pivotal task in machine learning and statistics, no consensus has been reached yet about which statistical rate to employ to this end. In the last century, the computer science and statistics communities have introduced several scores summing up the correctness of the predictions with respect to the ground truth values. Among these scores, the Matthews correlation coefficient (MCC) was shown to have several advantages over confusion entropy, accuracy, F 1 score, balanced accuracy, bookmaker informedness, markedness, and diagnostic odds ratio: MCC, in fact, produces a high score only if the majority of the predicted negative data instances and the majority of the positive data instances are correct, and therefore it results being very trustworthy on imbalanced datasets. In this study, we compare MCC with two other popular scores: Cohen's Kappa, a metric that originated in social sciences, and the Brier score, a strictly proper scoring function which emerged in weather forecasting studies. After explaining the mathematical properties and the relationships between MCC and each of these two rates, we report some use cases where these scores generate different values, which lead to discordant outcomes, where MCC provides a more truthful and informative result. We highlight the reasons why it is more advisable to use MCC rather that Cohen's Kappa and the Brier score to evaluate binary classifications.