New Risk Modeling Method for Robust Learning on Smaller Samples (original) (raw)
Related papers
Ensemble Risk Modeling Method for Robust Learning on Scarce Data
In medical risk modeling, typical data are "scarce": they have relatively small number of training instances (N), censoring, and high dimensionality (M). We show that the problem may be effectively simplified by reducing it to bipartite ranking, and introduce new bipartite ranking algorithm, Smooth Rank, for robust learning on scarce data. The algorithm is based on ensemble learning with unsupervised aggregation of predictors. The advantage of our approach is confirmed in comparison with two "gold standard" risk modeling methods on 10 real life survival analysis datasets, where the new approach has the best results on all but two datasets with the largest ratio N/M. For systematic study of the effects of data scarcity on modeling by all three methods, we conducted two types of computational experiments: on real life data with randomly drawn training sets of different sizes, and on artificial data with increasing number of features. Both experiments demonstrated that Smooth Rank has critical advantage over the popular methods on the scarce data; it does not suffer from overfitting where other methods do.
Bipartite ranking algorithm for classification and survival analysis
Unsupervised aggregation of independently built univariate predictors is explored as an alternative regularization approach for noisy, sparse datasets. Bipartite ranking algorithm Smooth Rank implementing this approach is introduced. The advantages of this algorithm are demonstrated on two types of problems. First, Smooth Rank is applied to two-class problems from bio-medical field, where ranking is often preferable to classification. In comparison against SVMs with radial and linear kernels, Smooth Rank had the best performance on 8 out of 12 benchmark benchmarks. The second area of application is survival analysis, which is reduced here to bipartite ranking in a way which allows one to use commonly accepted measures of methods performance. In comparison of Smooth Rank with Cox PH regression and CoxPath methods, Smooth Rank proved to be the best on 9 out of 10 benchmark datasets.
Learning transformation models for ranking and survival analysis
2011
This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP. The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the 'margin' is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O(n) constraints and O(n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O(n 2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets.
On Ranking in Survival Analysis: Bounds on the Concordance Index
In this paper, we show that classical survival analysis involving censored data can naturally be cast as a ranking problem. The concordance index (CI), which quantifies the quality of rankings, is the standard performance measure for model assessment in survival analysis. In contrast, the standard approach to learning the popular proportional hazard (PH) model is based on Cox's partial likelihood. We devise two bounds on CI-one of which emerges directly from the properties of PH models-and optimize them directly. Our experimental results suggest that all three methods perform about equally well, with our new approach giving slightly better results. We also explain why a method designed to maximize the Cox's partial likelihood also ends up (approximately) maximizing the CI.
A model-free machine learning method for risk classification and survival probability prediction
Stat, 2014
Risk classification and survival probability prediction are two major goals in survival data analysis since they play an important role in patients' risk stratification, long-term diagnosis, and treatment selection. In this article, we propose a new model-free machine learning framework for risk classification and survival probability prediction based on weighted support vector machines. The new procedure does not require any specific parametric or semiparametric model assumption on data, and is therefore capable of capturing nonlinear covariate effects. We use numerous simulation examples to demonstrate finite sample performance of the proposed method under various settings. Applications to a glioma tumor data and a breast cancer gene expression survival data are shown to illustrate the new methodology in real data analysis.
Adaptive Estimation of the Optimal ROC Curve and a Bipartite Ranking Algorithm
Lecture Notes in Computer Science, 2009
In this paper, we propose an adaptive algorithm for bipartite ranking and prove its statistical performance in a stronger sense than the AUC criterion. Our procedure builds on the RankOver algorithm proposed in (Clémençon & Vayatis, 2008a). The algorithm outputs a piecewise constant scoring rule which is obtained by overlaying a finite collection of classifiers. Here, each of these classifiers is the empirical solution of a specific minimum-volume set (MV-set) estimation problem. The main novelty arises from the fact that the levels of the MV-sets to recover are chosen adaptively from the data to adjust to the variability of the target curve. The ROC curve of the estimated scoring rule may be interpreted as an adaptive spline approximant of the optimal ROC curve. Error bounds for the estimate of the optimal ROC curve in terms of the L ∞-distance are also provided.
Stabilized sparse ordinal regression for medical risk stratification
Knowledge and Information Systems, 2014
The recent wide adoption of Electronic Medical Records (EMR) presents great opportunities and challenges for data mining. The EMR data is largely temporal, often noisy, irregular and high dimensional. This paper constructs a novel ordinal regression framework for predicting medical risk stratification from EMR. First, a conceptual view of EMR as a temporal image is constructed to extract a diverse set of features. Second, ordinal modeling is applied for predicting cumulative or progressive risk. The challenges are building a transparent predictive model that works with a large number of weakly predictive features, and at the same time, is stable against resampling variations. Our solution employs sparsity methods that are stabilized through domain-specific feature interaction networks. We introduces two indices that measure the model stability against data resampling. Feature networks are used to generate two multivariate Gaussian priors with sparse precision matrices (the Laplacian and Random Walk). We apply the framework on a large short-term suicide risk prediction problem and demonstrate that our methods outperform clinicians to a large-margin, discover suicide risk factors that conform with mental health knowledge, and produce models with enhanced stability.
IEEE Access
Due to the development of biomedical equipment and healthcare level, especially in the Intensive Care Unit (ICU), a considerable amount of data has been collected for analysis. Mortality prediction in the ICUs is considered as one of the most important topics in the healthcare data analysis section. A precise prediction of the mortality risk for patients in ICU could provide us with valuable information about patients' lives and reduce costs at the earliest possible stage. This paper aims to introduce a new hybrid predictive model using the Genetic Algorithm as a feature selection method and a new ensemble classifier based on the combination of Stacking and Boosting ensemble methods to create an early mortality prediction model on a highly imbalanced dataset. The SVM-SMOTE method is used to solve the imbalanced data problem. This paper compares the new model with various machine learning models to validate the efficiency of the introduced model. The achieved results using the shuffle 5-fold cross-validation and random hold-out methods indicate that the new hybrid model has the best performance among other classifiers. Additionally, the Friedman test is applied as a statistical significance test to examine the differences between classifiers. The results of the statistical analysis prove that the proposed model is more effective than other classifiers. Furthermore, the proposed model is compared to APACHE and SAPS scoring systems and is benchmarked against state-of-the-art predictive models applied to the MIMIC dataset for experimental validation and achieved promising results as it outperformed the state-of-the-art models. INDEX TERMS Classification, hybrid predictive model, stacking ensemble method, boosting ensemble method, intensive care unit (ICU), imbalanced data problem, machine learning in healthcare, SVM-SMOTE method, Friedman test.
SSRN Electronic Journal, 2022
The Concordance Index (C-index) is a commonly used metric in Survival Analysis to evaluate how good a prediction model is. This paper proposes a decomposition of the C-Index into a weighted harmonic mean of two quantities: one for ranking observed events versus other observed events, and the other for ranking observed events versus censored cases. This decomposition allows a more fine-grained analysis of the pros and cons of survival prediction methods. The utility of the decomposition is demonstrated using three benchmark survival analysis models (Cox Proportional Hazard, Random Survival Forest, and Deep Adversarial Time-to-Event Network) together with a new variational generative neuralnetwork-based method (SurVED), which is also proposed in this paper. The demonstration is done on four publicly available datasets with varying censoring levels. The analysis with the C-index decomposition shows that all methods essentially perform equally well when the censoring level is high because of the dominance of the term measuring the ranking of events versus censored cases. In contrast, some methods deteriorate when the censoring level decreases because they do not rank the events versus other events well.
Diffsurv: Differentiable sorting for censored time-to-event data
arXiv (Cornell University), 2023
Survival analysis is a crucial semi-supervised task in machine learning with numerous real-world applications, particularly in healthcare. Currently, the most common approach to survival analysis is based on Cox's partial likelihood, which can be interpreted as a ranking model optimized on a lower bound of the concordance index. This relation between ranking models and Cox's partial likelihood considers only pairwise comparisons. Recent work has developed differentiable sorting methods which relax this pairwise independence assumption, enabling the ranking of sets of samples. However, current differentiable sorting methods can not account for censoring, a key factor in many real-world datasets. To address this limitation, we propose a novel method called Diffsurv. We extend differentiable sorting methods to handle censored tasks by predicting matrices of possible permutations that take into account the label uncertainty introduced by censored samples. We contrast this approach with methods derived from partial likelihood and ranking losses. Our experiments show that Diffsurv outperforms established baselines in various simulated and real-world risk prediction scenarios. Additionally, we demonstrate the benefits of the algorithmic supervision enabled by Diffsurv by presenting a novel method for top-k risk prediction that outperforms current methods.