ROn Shamir - Academia.edu (original) (raw)
Papers by ROn Shamir
bioRxiv (Cold Spring Harbor Laboratory), Jun 3, 2024
Polygenic risk scores (PRS) predict individuals’ genetic risk of developing complex diseases. The... more Polygenic risk scores (PRS) predict individuals’ genetic risk of developing complex diseases. They summarize the effect of many genetic variants discovered in genome-wide association studies (GWASs). However, to date, large GWASs exist primarily for the European population and the quality of PRS prediction declines when applied to target sets of other ethnicities. A key step in using a PRS is imputation, which is the inference of un-typed SNPs using a set of fully-sequenced individuals, called the imputation panel. The SNP genotypes called by the imputation process depend on the ethnic composition of the imputation panel. Several studies have shown that imputing genotypes using a panel that contains individuals of the same ethnicity as the genotyped individuals improves imputation accuracy. However, until now, there has been no systematic investigation into the influence of the ethnic composition of imputation panels on the accuracy of PRS predictions when applied to ethnic groups t...
BackgroundMicrobial communities usually harbor a mix of bacteria, archaea, phages, plasmids, and ... more BackgroundMicrobial communities usually harbor a mix of bacteria, archaea, phages, plasmids, and microeukaryotes. Phages, plasmids, and microeukaryotes, which are present in low abundance in microbial communities, have complex interactions with bacteria and play important roles in horizontal gene transfer and antibiotic resistance. However, due to the difficulty of identifying phages, plasmids, and microeukaryotes in microbial communities, our understanding of these minor classes lags behind that of bacteria and archaea. Recently, several classifiers have been developed to separate one or two minor classes from bacteria and archaea in metagenome assemblies, but none can classify all of the four classes simultaneously. Moreover, existing classifiers have low precision on minor classes.ResultsWe developed for the first time a classifier called 4CAC that is able to identify phages, plasmids, microeukaryotes, and prokaryotes simultaneously from metagenome assemblies. 4CAC generates an i...
Clustering methods are often applied to electronic medical records (EMR) data for various objecti... more Clustering methods are often applied to electronic medical records (EMR) data for various objectives, including the discovery of previously unrecognized disease subtypes. The abundance and redundancy of information in EMR data raises the need to identify and rank the features that are most relevant for clustering. Here we propose FRIGATE, an ensemble feature ranking algorithm for clustering, which uses game-theoretic concepts. FRIGATE derives the importance of features from solving multiple clustering problems on subgroups of features. In every such problem, a Shapley-like framework is utilized to rank a selected set of features, and multiplicative weights are employed to reduce the randomness in their selection. It outperforms extant ensemble ranking algorithms, both in solution quality and in speed. FRIGATE can improve disease understanding by enabling better subtype discovery from EMR data.
Scientific Reports
We sought to divide COVID-19 patients into distinct phenotypical subgroups using echocardiography... more We sought to divide COVID-19 patients into distinct phenotypical subgroups using echocardiography and clinical markers to elucidate the pathogenesis of the disease and its heterogeneous cardiac involvement. A total of 506 consecutive patients hospitalized with COVID-19 infection underwent complete evaluation, including echocardiography, at admission. A k-prototypes algorithm applied to patients' clinical and imaging data at admission partitioned the patients into four phenotypical clusters: Clusters 0 and 1 were younger and healthier, 2 and 3 were older with worse cardiac indexes, and clusters 1 and 3 had a stronger inflammatory response. The clusters manifested very distinct survival patterns (C-index for the Cox proportional hazard model 0.77), with survival best for cluster 0, intermediate for 1–2 and worst for 3. Interestingly, cluster 1 showed a harsher disease course than cluster 2 but with similar survival. Clusters obtained with echocardiography were more predictive of m...
The challenge of survival prediction is ubiquitous in industry and medicine. Few methods are avai... more The challenge of survival prediction is ubiquitous in industry and medicine. Few methods are available for survival prediction of time varying data. Here we propose a novel method for this problem, using a random forest of survival trees for left truncated and right-censored data. We demonstrated the advantage of our method on prediction of breast cancer and prostate gland cancer risk among healthy individuals by analyzing routine laboratory measurements, vital signs and age. We analyzed electronic medical records of 20,317 healthy individuals who underwent routine checkups and identified those who later developed cancer. In cross-validation, our method predicted future prostate and breast cancers six months before diagnosis with an area under the ROC curve of 0.62±0.05 and 0.6±0.03 respectively, outperforming standard random forest, Cox-regression model and a single survival tree. Our results suggest that computational analysis of data on healthy individuals can improve the detecti...
Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature, 2018
Network-based module discovery (NBMD) methods are central to analysis of omics data. Such algorit... more Network-based module discovery (NBMD) methods are central to analysis of omics data. Such algorithms receive a gene network and nodes’ activity scores as input and report sub-networks (modules) that are putatively biologically active. Although such methods exist for almost two decades, only a handful of studies attempted to compare the biological signals captured by different methods. Here, we systematically evaluated six popular NBMD methods on gene expression (GE) and GWAS data. Notably, we observed that GO terms enriched in modules detected by these methods on the real data were often also enriched after randomly permuting the input data. To tackle this bias, we designed a method that evaluates the empirical significance of GO terms reported as enriched in modules. We used the method to fashion five novel performance criteria for evlautating NBMD methods. Last, we developed a novel NBMD algorithm called DOMINO. In extensive testing on GE and GWAS data it outperformed the other si...
Background: Metagenomic sequencing has led to the identification and assembly of many new bacteri... more Background: Metagenomic sequencing has led to the identification and assembly of many new bacterial genome sequences. These bacteria often contain plasmids: usually small, circular double-stranded DNA molecules that may transfer across bacterial species and confer antibiotic resistance. These plasmids are generally less studied and understood than their bacterial hosts. Part of the reason for this is insufficient computational tools enabling the analysis of plasmids in metagenomic samples. Results: We developed SCAPP (Sequence Contents-Aware Plasmid Peeler) - an algorithm and tool to assemble plasmid sequences from metagenomic sequencing. SCAPP builds on some key ideas from the Recycler algorithm while improving plasmid assemblies by integrating biological knowledge about plasmids. We compared the performance of SCAPP to Recycler and metaplasmidSPAdes on simulated metagenomes, real human gut microbiome samples, and a human gut plasmidome dataset that we generated. We also created pl...
Environmental Microbiology, 2019
SummaryHorizontal gene transfer via plasmids plays a pivotal role in microbial evolution. The for... more SummaryHorizontal gene transfer via plasmids plays a pivotal role in microbial evolution. The forces that shape plasmidomes functionality and distribution in natural environments are insufficiently understood. Here, we present a comparative study of plasmidomes across adjacent microbial environments present in different individual rumen microbiomes. Our findings show that the rumen plasmidome displays enormous unknown functional potential currently unannotated in available databases. Nevertheless, this unknown functionality is conserved and shared with published rat gut plasmidome data. Moreover, the rumen plasmidome is highly diverse compared with the microbiome that hosts these plasmids, across both similar and different rumen habitats. Our analysis demonstrates that its structure is shaped more by stochasticity than selection. Nevertheless, the plasmidome is an active partner in its intricate relationship with the host microbiome with both interacting with and responding to their...
Bioinformatics, 2022
Motivation Active module identification (AMI) is an essential step in many omics analyses. Such a... more Motivation Active module identification (AMI) is an essential step in many omics analyses. Such algorithms receive a gene network and a gene activity profile as input and report subnetworks that show significant over-representation of accrued activity signal (‘active modules’). Such modules can point out key molecular processes in the analyzed biological conditions. Results We recently introduced a novel AMI algorithm called DOMINO and demonstrated that it detects active modules that capture biological signals with markedly improved rate of empirical validation. Here, we provide an online server that executes DOMINO, making it more accessible and user-friendly. To help the interpretation of solutions, the server provides GO enrichment analysis, module visualizations and accessible output formats for customized downstream analysis. It also enables running DOMINO with various gene identifiers of different organisms. Availability and implementation The server is available at http://dom...
MotivationSequencing long reads presents novel challenges to mapping. One such challenge is low s... more MotivationSequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacteria. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors.ResultsWe introduce parameterized syncmer schemes, a generalization of syncmers, and provide a theoretical analysis for multi-parameter schemes. By combining these schemes with downsampling or minimizers we can achieve any desired compression and window guarantee. We implemented the use of parameterized syncmer schemes in the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long read data from a variety of genomes, the syncmer-based algorithms, with scheme paramet...
The rapid, continuous growth of deep sequencing experiments requires development and improvement ... more The rapid, continuous growth of deep sequencing experiments requires development and improvement of many bioinformatics applications for analysis of large sequencing datasets, including k-mer counting and assembly. Several applications reduce RAM usage by binning sequences. Binning is done by employing minimizer schemes, which rely on a specific order of the minimizers. It has been demonstrated that the choice of the order has a major impact on the performance of the applications. Here we introduce a method for tailoring the order to the dataset. Our method repeatedly samples the dataset and modifies the order so as to flatten the k-mer load distribution across minimizers. We integrated our method into Gerbil, a state-of-the-art memory efficient k-mer counter, and were able to reduce its memory footprint by 30% - 50% for large k, with only minor increase in runtime. Our tests also showed that the orders produced by our method produced superior results when transferred across dataset...
ABSTRACTSpatiotemporal gene expression patterns are governed to a large extent by the activity of... more ABSTRACTSpatiotemporal gene expression patterns are governed to a large extent by the activity of enhancer elements, which engage in physical contacts with their target genes. Identification of enhancer-promoter (EP) links that are functional only in a specific subset of cell types is a key challenge in understanding gene regulation. We introduce CT-FOCS, a statistical inference method that uses linear mixed effect models to infer EP links that show marked activity only in a single or a small subset of cell types out of a large panel of probed cell types. Analyzing 808 samples from FANTOM5, covering 472 cell lines, primary cells, and tissues, CT-FOCS inferred such EP links more accurately than recent state-of-the-art methods. Furthermore, we show that strictly cell type-specific EP links are very uncommon in the human genome.
Scientific Reports
The COVID-19 pandemic has been spreading worldwide since December 2019, presenting an urgent thre... more The COVID-19 pandemic has been spreading worldwide since December 2019, presenting an urgent threat to global health. Due to the limited understanding of disease progression and of the risk factors for the disease, it is a clinical challenge to predict which hospitalized patients will deteriorate. Moreover, several studies suggested that taking early measures for treating patients at risk of deterioration could prevent or lessen condition worsening and the need for mechanical ventilation. We developed a predictive model for early identification of patients at risk for clinical deterioration by retrospective analysis of electronic health records of COVID-19 inpatients at the two largest medical centers in Israel. Our model employs machine learning methods and uses routine clinical features such as vital signs, lab measurements, demographics, and background disease. Deterioration was defined as a high NEWS2 score adjusted to COVID-19. In the prediction of deterioration within the next...
bioRxiv (Cold Spring Harbor Laboratory), Jun 3, 2024
Polygenic risk scores (PRS) predict individuals’ genetic risk of developing complex diseases. The... more Polygenic risk scores (PRS) predict individuals’ genetic risk of developing complex diseases. They summarize the effect of many genetic variants discovered in genome-wide association studies (GWASs). However, to date, large GWASs exist primarily for the European population and the quality of PRS prediction declines when applied to target sets of other ethnicities. A key step in using a PRS is imputation, which is the inference of un-typed SNPs using a set of fully-sequenced individuals, called the imputation panel. The SNP genotypes called by the imputation process depend on the ethnic composition of the imputation panel. Several studies have shown that imputing genotypes using a panel that contains individuals of the same ethnicity as the genotyped individuals improves imputation accuracy. However, until now, there has been no systematic investigation into the influence of the ethnic composition of imputation panels on the accuracy of PRS predictions when applied to ethnic groups t...
BackgroundMicrobial communities usually harbor a mix of bacteria, archaea, phages, plasmids, and ... more BackgroundMicrobial communities usually harbor a mix of bacteria, archaea, phages, plasmids, and microeukaryotes. Phages, plasmids, and microeukaryotes, which are present in low abundance in microbial communities, have complex interactions with bacteria and play important roles in horizontal gene transfer and antibiotic resistance. However, due to the difficulty of identifying phages, plasmids, and microeukaryotes in microbial communities, our understanding of these minor classes lags behind that of bacteria and archaea. Recently, several classifiers have been developed to separate one or two minor classes from bacteria and archaea in metagenome assemblies, but none can classify all of the four classes simultaneously. Moreover, existing classifiers have low precision on minor classes.ResultsWe developed for the first time a classifier called 4CAC that is able to identify phages, plasmids, microeukaryotes, and prokaryotes simultaneously from metagenome assemblies. 4CAC generates an i...
Clustering methods are often applied to electronic medical records (EMR) data for various objecti... more Clustering methods are often applied to electronic medical records (EMR) data for various objectives, including the discovery of previously unrecognized disease subtypes. The abundance and redundancy of information in EMR data raises the need to identify and rank the features that are most relevant for clustering. Here we propose FRIGATE, an ensemble feature ranking algorithm for clustering, which uses game-theoretic concepts. FRIGATE derives the importance of features from solving multiple clustering problems on subgroups of features. In every such problem, a Shapley-like framework is utilized to rank a selected set of features, and multiplicative weights are employed to reduce the randomness in their selection. It outperforms extant ensemble ranking algorithms, both in solution quality and in speed. FRIGATE can improve disease understanding by enabling better subtype discovery from EMR data.
Scientific Reports
We sought to divide COVID-19 patients into distinct phenotypical subgroups using echocardiography... more We sought to divide COVID-19 patients into distinct phenotypical subgroups using echocardiography and clinical markers to elucidate the pathogenesis of the disease and its heterogeneous cardiac involvement. A total of 506 consecutive patients hospitalized with COVID-19 infection underwent complete evaluation, including echocardiography, at admission. A k-prototypes algorithm applied to patients' clinical and imaging data at admission partitioned the patients into four phenotypical clusters: Clusters 0 and 1 were younger and healthier, 2 and 3 were older with worse cardiac indexes, and clusters 1 and 3 had a stronger inflammatory response. The clusters manifested very distinct survival patterns (C-index for the Cox proportional hazard model 0.77), with survival best for cluster 0, intermediate for 1–2 and worst for 3. Interestingly, cluster 1 showed a harsher disease course than cluster 2 but with similar survival. Clusters obtained with echocardiography were more predictive of m...
The challenge of survival prediction is ubiquitous in industry and medicine. Few methods are avai... more The challenge of survival prediction is ubiquitous in industry and medicine. Few methods are available for survival prediction of time varying data. Here we propose a novel method for this problem, using a random forest of survival trees for left truncated and right-censored data. We demonstrated the advantage of our method on prediction of breast cancer and prostate gland cancer risk among healthy individuals by analyzing routine laboratory measurements, vital signs and age. We analyzed electronic medical records of 20,317 healthy individuals who underwent routine checkups and identified those who later developed cancer. In cross-validation, our method predicted future prostate and breast cancers six months before diagnosis with an area under the ROC curve of 0.62±0.05 and 0.6±0.03 respectively, outperforming standard random forest, Cox-regression model and a single survival tree. Our results suggest that computational analysis of data on healthy individuals can improve the detecti...
Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature, 2018
Network-based module discovery (NBMD) methods are central to analysis of omics data. Such algorit... more Network-based module discovery (NBMD) methods are central to analysis of omics data. Such algorithms receive a gene network and nodes’ activity scores as input and report sub-networks (modules) that are putatively biologically active. Although such methods exist for almost two decades, only a handful of studies attempted to compare the biological signals captured by different methods. Here, we systematically evaluated six popular NBMD methods on gene expression (GE) and GWAS data. Notably, we observed that GO terms enriched in modules detected by these methods on the real data were often also enriched after randomly permuting the input data. To tackle this bias, we designed a method that evaluates the empirical significance of GO terms reported as enriched in modules. We used the method to fashion five novel performance criteria for evlautating NBMD methods. Last, we developed a novel NBMD algorithm called DOMINO. In extensive testing on GE and GWAS data it outperformed the other si...
Background: Metagenomic sequencing has led to the identification and assembly of many new bacteri... more Background: Metagenomic sequencing has led to the identification and assembly of many new bacterial genome sequences. These bacteria often contain plasmids: usually small, circular double-stranded DNA molecules that may transfer across bacterial species and confer antibiotic resistance. These plasmids are generally less studied and understood than their bacterial hosts. Part of the reason for this is insufficient computational tools enabling the analysis of plasmids in metagenomic samples. Results: We developed SCAPP (Sequence Contents-Aware Plasmid Peeler) - an algorithm and tool to assemble plasmid sequences from metagenomic sequencing. SCAPP builds on some key ideas from the Recycler algorithm while improving plasmid assemblies by integrating biological knowledge about plasmids. We compared the performance of SCAPP to Recycler and metaplasmidSPAdes on simulated metagenomes, real human gut microbiome samples, and a human gut plasmidome dataset that we generated. We also created pl...
Environmental Microbiology, 2019
SummaryHorizontal gene transfer via plasmids plays a pivotal role in microbial evolution. The for... more SummaryHorizontal gene transfer via plasmids plays a pivotal role in microbial evolution. The forces that shape plasmidomes functionality and distribution in natural environments are insufficiently understood. Here, we present a comparative study of plasmidomes across adjacent microbial environments present in different individual rumen microbiomes. Our findings show that the rumen plasmidome displays enormous unknown functional potential currently unannotated in available databases. Nevertheless, this unknown functionality is conserved and shared with published rat gut plasmidome data. Moreover, the rumen plasmidome is highly diverse compared with the microbiome that hosts these plasmids, across both similar and different rumen habitats. Our analysis demonstrates that its structure is shaped more by stochasticity than selection. Nevertheless, the plasmidome is an active partner in its intricate relationship with the host microbiome with both interacting with and responding to their...
Bioinformatics, 2022
Motivation Active module identification (AMI) is an essential step in many omics analyses. Such a... more Motivation Active module identification (AMI) is an essential step in many omics analyses. Such algorithms receive a gene network and a gene activity profile as input and report subnetworks that show significant over-representation of accrued activity signal (‘active modules’). Such modules can point out key molecular processes in the analyzed biological conditions. Results We recently introduced a novel AMI algorithm called DOMINO and demonstrated that it detects active modules that capture biological signals with markedly improved rate of empirical validation. Here, we provide an online server that executes DOMINO, making it more accessible and user-friendly. To help the interpretation of solutions, the server provides GO enrichment analysis, module visualizations and accessible output formats for customized downstream analysis. It also enables running DOMINO with various gene identifiers of different organisms. Availability and implementation The server is available at http://dom...
MotivationSequencing long reads presents novel challenges to mapping. One such challenge is low s... more MotivationSequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacteria. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors.ResultsWe introduce parameterized syncmer schemes, a generalization of syncmers, and provide a theoretical analysis for multi-parameter schemes. By combining these schemes with downsampling or minimizers we can achieve any desired compression and window guarantee. We implemented the use of parameterized syncmer schemes in the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long read data from a variety of genomes, the syncmer-based algorithms, with scheme paramet...
The rapid, continuous growth of deep sequencing experiments requires development and improvement ... more The rapid, continuous growth of deep sequencing experiments requires development and improvement of many bioinformatics applications for analysis of large sequencing datasets, including k-mer counting and assembly. Several applications reduce RAM usage by binning sequences. Binning is done by employing minimizer schemes, which rely on a specific order of the minimizers. It has been demonstrated that the choice of the order has a major impact on the performance of the applications. Here we introduce a method for tailoring the order to the dataset. Our method repeatedly samples the dataset and modifies the order so as to flatten the k-mer load distribution across minimizers. We integrated our method into Gerbil, a state-of-the-art memory efficient k-mer counter, and were able to reduce its memory footprint by 30% - 50% for large k, with only minor increase in runtime. Our tests also showed that the orders produced by our method produced superior results when transferred across dataset...
ABSTRACTSpatiotemporal gene expression patterns are governed to a large extent by the activity of... more ABSTRACTSpatiotemporal gene expression patterns are governed to a large extent by the activity of enhancer elements, which engage in physical contacts with their target genes. Identification of enhancer-promoter (EP) links that are functional only in a specific subset of cell types is a key challenge in understanding gene regulation. We introduce CT-FOCS, a statistical inference method that uses linear mixed effect models to infer EP links that show marked activity only in a single or a small subset of cell types out of a large panel of probed cell types. Analyzing 808 samples from FANTOM5, covering 472 cell lines, primary cells, and tissues, CT-FOCS inferred such EP links more accurately than recent state-of-the-art methods. Furthermore, we show that strictly cell type-specific EP links are very uncommon in the human genome.
Scientific Reports
The COVID-19 pandemic has been spreading worldwide since December 2019, presenting an urgent thre... more The COVID-19 pandemic has been spreading worldwide since December 2019, presenting an urgent threat to global health. Due to the limited understanding of disease progression and of the risk factors for the disease, it is a clinical challenge to predict which hospitalized patients will deteriorate. Moreover, several studies suggested that taking early measures for treating patients at risk of deterioration could prevent or lessen condition worsening and the need for mechanical ventilation. We developed a predictive model for early identification of patients at risk for clinical deterioration by retrospective analysis of electronic health records of COVID-19 inpatients at the two largest medical centers in Israel. Our model employs machine learning methods and uses routine clinical features such as vital signs, lab measurements, demographics, and background disease. Deterioration was defined as a high NEWS2 score adjusted to COVID-19. In the prediction of deterioration within the next...