A multi-modal graph-based semi-supervised pipeline for predicting cancer survival (original) (raw)
Related papers
Improved survival analysis by learning shared genomic information from pan-cancer data
Bioinformatics, 2020
Motivation Recent advances in deep learning have offered solutions to many biomedical tasks. However, there remains a challenge in applying deep learning to survival analysis using human cancer transcriptome data. As the number of genes, the input variables of survival model, is larger than the amount of available cancer patient samples, deep-learning models are prone to overfitting. To address the issue, we introduce a new deep-learning architecture called VAECox. VAECox uses transfer learning and fine tuning. Results We pre-trained a variational autoencoder on all RNA-seq data in 20 TCGA datasets and transferred the trained weights to our survival prediction model. Then we fine-tuned the transferred weights during training the survival model on each dataset. Results show that our model outperformed other previous models such as Cox Proportional Hazard with LASSO and ridge penalty and Cox-nnet on the 7 of 10 TCGA datasets in terms of C-index. The results signify that the transferre...
Integrating Biological Knowledge with Gene Expression Profiles for Survival Prediction of Cancer
Journal of Computational Biology, 2009
Due to the large variability in survival times between cancer patients and the plethora of genes on microarrays unrelated to outcome, building accurate prediction models that are easy to interpret remains a challenge. In this paper, we propose a general strategy for improving performance and interpretability of prediction models by integrating gene expression data with prior biological knowledge. First, we link gene identifiers in expression dataset with gene annotation databases such as Gene Ontology (GO). Then we construct "supergenes" for each gene category by summarizing information from genes related to outcome using a modified principal component analysis (PCA) method. Finally, instead of using genes as predictors, we use these supergenes representing information from each gene category as predictors to predict survival outcome. In addition to identifying gene categories associated with outcome, the proposed approach also carries out additional within-category selection to select important genes within each gene set. We show, using two real breast cancer microarray datasets, that the prediction models constructed based on gene sets (or pathway) information outperform the prediction models based on expression values of single genes, with improved prediction accuracy and interpretability. 265 266 CHEN AND WANG However, due to the large variability in survival times between cancer patients and the plethora of genes on the microarrays unrelated to outcome, building accurate prediction models that are easy to interpret remains a challenge. In this paper, we propose a general strategy for improving performance and interpretability of prediction models by integrating gene expression data with prior biological knowledge. For simplicity, we use the terms "gene categories," "pathways," and "gene sets" interchangeably, although they may not be strictly equivalent. The idea is simple: first, we link gene identifiers in expression dataset with gene annotation databases such as Gene Ontology (GO) . Then we construct "supergenes" for each pathway by summarizing information from genes related to outcome using a modified PCA method. Finally, instead of using genes as predictors, we use these supergenes representing information from each pathway as predictors to predict survival outcome.
Learning Individual Survival Models from PanCancer Whole Transcriptome Data
Clinical Cancer Research
Purpose: Personalized medicine attempts to predict survival time for each patient, based on their individual tumor molecular profile. We investigate whether our survival learner in combination with a dimension reduction method can produce useful survival estimates for a variety of patients with cancer. Patients and Methods: This article provides a method that learns a model for predicting the survival time for individual patients with cancer from the PanCancer Atlas: given the (16,335 dimensional) gene expression profiles from 10,173 patients, each having one of 33 cancers, this method uses unsupervised nonnegative matrix factorization (NMF) to reexpress the gene expression data for each patient in terms of 100 learned NMF factors. It then feeds these 100 factors into the Multi-Task Logistic Regression (MTLR) learner to produce cancer-specific models for each of 20 cancers (with >50 uncensored instances); this produces “individual survival distributions” (ISD), which provide surv...
PLoS Computational Biology, 2012
Predicting the clinical outcome of cancer patients based on the expression of marker genes in their tumors has received increasing interest in the past decade. Accurate predictors of outcome and response to therapy could be used to personalize and thereby improve therapy. However, state of the art methods used so far often found marker genes with limited prediction accuracy, limited reproducibility, and unclear biological relevance. To address this problem, we developed a novel computational approach to identify genes prognostic for outcome that couples gene expression measurements from primary tumor samples with a network of known relationships between the genes. Our approach ranks genes according to their prognostic relevance using both expression and network information in a manner similar to Google's PageRank. We applied this method to gene expression profiles which we obtained from 30 patients with pancreatic cancer, and identified seven candidate marker genes prognostic for outcome. Compared to genes found with state of the art methods, such as Pearson correlation of gene expression with survival time, we improve the prediction accuracy by up to 7%. Accuracies were assessed using support vector machine classifiers and Monte Carlo cross-validation. We then validated the prognostic value of our seven candidate markers using immunohistochemistry on an independent set of 412 pancreatic cancer samples. Notably, signatures derived from our candidate markers were independently predictive of outcome and superior to established clinical prognostic factors such as grade, tumor size, and nodal status. As the amount of genomic data of individual tumors grows rapidly, our algorithm meets the need for powerful computational approaches that are key to exploit these data for personalized cancer therapies in clinical practice.
Journal of Automation and Control Engineering, 2015
Abstract-A successful classification of different tumor types is essential for successful treatment of cancer. However, most prior cancer classification methods are clinical-based and have inadequate diagnostic ability. Cancer classification using gene expression data is very important in cancer diagnosis and drug discovery. The introduction of DNA microarray techniques has made simultaneous monitoring of thousands of gene expression probable. With this abundance of gene expression data nowadays, the researchers have the opportunity to do cancer classification using gene expression data. In recent years, a lot of machine learning methods have been proposed to do cancer classification using gene expression data such as clustering-based methods, k-nearest neighbor method, artificial neural network method, and support vector machine method, to name a few. In this paper, we present the un-normalized graph p-Laplacian semisupervised learning methods. These methods will be applied to the patient-patient network constructed from the gene expression data to predict the tumor types of all patients in the network. These methods are based on the assumption that the labels of two adjacent patients in the network are likely to be the same. The experiments show that that the un-normalized graph p-Laplacian semi-supervised learning methods are at least as good as the current state of the art network-based method (the un-normalized graph Laplacian based semi-supervised learning method) but often lead to better classification accuracy performance measures.
Deep learning-based cancer survival prognosis from RNA-seq data: approaches and evaluations
BMC Medical Genomics, 2020
Background Recent advances in kernel-based Deep Learning models have introduced a new era in medical research. Originally designed for pattern recognition and image processing, Deep Learning models are now applied to survival prognosis of cancer patients. Specifically, Deep Learning versions of the Cox proportional hazards models are trained with transcriptomic data to predict survival outcomes in cancer patients. Methods In this study, a broad analysis was performed on TCGA cancers using a variety of Deep Learning-based models, including Cox-nnet, DeepSurv, and a method proposed by our group named AECOX (AutoEncoder with Cox regression network). Concordance index and p-value of the log-rank test are used to evaluate the model performances. Results All models show competitive results across 12 cancer types. The last hidden layers of the Deep Learning approaches are lower dimensional representations of the input data that can be used for feature reduction and visualization. Furthermo...
Oncotarget
Accurate prediction of prognosis is critical for therapeutic decisions regarding cancer patients. Many previously developed prognostic scoring systems have limitations in reflecting recent progress in the field of cancer biology such as microarray, next-generation sequencing, and signaling pathways. To develop a new prognostic scoring system for cancer patients, we used mRNA expression and clinical data in various independent breast cancer cohorts (n=1214) from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) and Gene Expression Omnibus (GEO). A new prognostic score that reflects gene network inherent in genomic big data was calculated using Network-Regularized highdimensional Cox-regression (Net-score). We compared its discriminatory power with those of two previously used statistical methods: stepwise variable selection via univariate Cox regression (Uni-score) and Cox regression via Elastic net (Enet-score). The Net scoring system showed better discriminatory power in prediction of diseasespecific survival (DSS) than other statistical methods (p=0 in METABRIC training cohort, p=0.000331, 4.58e-06 in two METABRIC validation cohorts) when accuracy was examined by log-rank test. Notably, comparison of C-index and AUC values in receiver operating characteristic analysis at 5 years showed fewer differences between training and validation cohorts with the Net scoring system than other statistical methods, suggesting minimal overfitting. The Net-based scoring system also successfully predicted prognosis in various independent GEO cohorts with high discriminatory power. In conclusion, the Net-based scoring system showed better discriminative power than previous statistical methods in prognostic prediction for breast cancer patients. This new system will mark a new era in prognosis prediction for cancer patients.
A semi-supervised method for predicting cancer survival using incomplete clinical data
2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2015
Prediction of survival for cancer patients is an open area of research. However, many of these studies focus on datasets with a large number of patients. We present a novel method that is specifically designed to address the challenge of data scarcity, which is often the case for cancer datasets. Our method is able to use unlabeled data to improve classification by adopting a semi-supervised training approach to learn an ensemble classifier. The results of applying our method to three cancer datasets show the promise of semi-supervised learning for prediction of cancer survival.
Iranian journal of public health, 2017
Today, despite the many advances in early detection of diseases, cancer patients have a poor prognosis and the survival rates in them are low. Recently, microarray technologies have been used for gathering thousands data about the gene expression level of cancer cells. These types of data are the main indicators in survival prediction of cancer. This study highlights the improvement of survival prediction based on gene expression data by using machine learning techniques in cancer patients. This review article was conducted by searching articles between 2000 to 2016 in scientific databases and e-Journals. We used keywords such as machine learning, gene expression data, survival and cancer. Studies have shown the high accuracy and effectiveness of gene expression data in comparison with clinical data in survival prediction. Because of bewildering and high volume of such data, studies have highlighted the importance of machine learning algorithms such as Artificial Neural Networks (AN...
Network modeling of patients' biomolecular profiles for clinical phenotype/outcome prediction
Scientific Reports
Methods for phenotype and outcome prediction are largely based on inductive supervised models that use selected biomarkers to make predictions, without explicitly considering the functional relationships between individuals. We introduce a novel network-based approach named Patient-Net (P-Net) in which biomolecular profiles of patients are modeled in a graph-structured space that represents gene expression relationships between patients. Then a kernel-based semi-supervised transductive algorithm is applied to the graph to explore the overall topology of the graph and to predict the phenotype/clinical outcome of patients. Experimental tests involving several publicly available datasets of patients afflicted with pancreatic, breast, colon and colorectal cancer show that our proposed method is competitive with state-of-the-art supervised and semi-supervised predictive systems. Importantly, P-Net also provides interpretable models that can be easily visualized to gain clues about the re...