Feature selection strategies for drug sensitivity prediction - PubMed (original) (raw)

Feature selection strategies for drug sensitivity prediction

Krzysztof Koras et al. Sci Rep. 2020.

Abstract

Drug sensitivity prediction constitutes one of the main challenges in personalized medicine. Critically, the sensitivity of cancer cells to treatment depends on an unknown subset of a large number of biological features. Here, we compare standard, data-driven feature selection approaches to feature selection driven by prior knowledge of drug targets, target pathways, and gene expression signatures. We asses these methodologies on Genomics of Drug Sensitivity in Cancer (GDSC) dataset, evaluating 2484 unique models. For 23 drugs, better predictive performance is achieved when the features are selected according to prior knowledge of drug targets and pathways. The best correlation of observed and predicted response using the test set is achieved for Linifanib (r = 0.75). Extending the drug-dependent features with gene expression signatures yields the most predictive models for 60 drugs, with the best performing example of Dabrafenib. For many compounds, even a very small subset of drug-related features is highly predictive of drug sensitivity. Small feature sets selected using prior knowledge are more predictive for drugs targeting specific genes and pathways, while models with wider feature sets perform better for drugs affecting general cellular mechanisms. Appropriate feature selection strategies facilitate the development of interpretable models that are indicative for therapy design.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1

Figure 1

Flowchart describing the modeling framework for a single compound. Abbreviations: GW – genome-wide, PG – pathway genes, OT – only targets, EN – elastic net, RF – random forest, SEL – automated feature selection, S – gene expression signatures. For every feature space, we performed modeling separately for each drug. We randomly split the corresponding data into training and test set, with 0.3 of the data included in the test set. We used 3-fold cross-validation on the training data for hyperparameter tuning and evaluated the best model on the test set. The whole modeling process was repeated five times with different training/test set data splits.

Figure 2

Figure 2

Models’ properties and response variable grouped by target pathways. (a) Number of input features across compounds in different methods. For genome-wide models, number of features was 17737 for each drug. Vertical axis uses log scale. (b) Number of samples across compounds in different methods. Abbreviation SS refers to stability selection (Methods). (c) AUC values grouped by target pathway of the drug, raw data from GDSC. Target pathways are sorted by interquartile range of the AUC values. Pathways corresponding to more general cell mechanisms are marked with red dots. See Fig. 1 for abbreviations.

Figure 3

Figure 3

Predictive performance for all of the analyzed drugs. (a) 1 - RMSE versus correlation per drug, obtained by elastic net using genome-wide gene expression data as predictors. For 1-RMSE, higher values correspond to better performance. (b) Correlation versus standard deviation of true AUC for all cell lines screened for a given drug, correlation obtained by genome-wide elastic net. (c) RelRMSE versus correlation obtained by the best model for a given drug. Higher values of RelRMSE correspond to better performance and improvement over a dummy model, which predicts average AUC. Each point represents a single drug. For each of them, corresponding best performance was determined using correlation as a metric. Colors represent models with feature set that obtained the best performance for a given drug. Horizontal line at 1 represents the baseline RelRMSE score. Most of these correlations are statistically significant (test based on Student’s t-distribution at 0.05 significance level, Fig. S1). (d) Distribution of per-drug predictive performance grouped by per-drug number of available samples. Colors represent models with feature set that obtained the best performance for a given drug. See Fig. 1 for model abbreviations.

Figure 4

Figure 4

Frequencies of all applied methods among best models per drug. (a) Correlation of AUC predictions with the true AUC values in the test set across compounds in methods with different feature spaces. Results are shown for 175 drugs which were common across all applied models. (b) Model frequencies for compounds for which all methods were applied. (c) Differences in correlation between best model per drug overall and best model from the other class. Two cases are shown – genome-wide and biologically driven feature sets. (d) Model frequencies among best models for compounds where models with biologically driven could not have been applied. See Fig. 1 for abbreviations.

Figure 5

Figure 5

Predictive performance in relation to compounds’ target pathway. (a) Correlation with the test set grouped by pathways. Methods were classified into two groups – one that uses genome-wide feature space, and one with biologically driven feature space. Numbers displayed represent p-values for the one-sided Mann-Whitney-Wilcoxon test. Lack of number means no statistical significance at 0.05 significance level. (b) Predictive performance for drugs with DNA replication target pathway. (c) Predictive performance for drugs with RTK signaling pathway. See Fig. 1 for model abbreviations.

Figure 6

Figure 6

Frequencies of considered feature types among top k most predictive features. Feature importance coefficients were extracted from top 50 drugs in terms of modeling performance using methods with biologically driven feature space.

Figure 7

Figure 7

Results for specific compounds exhibiting good ability to model with one or all of the methods. Displayed numbers represent number of features which was used by the best performing model for a particular drug. Top horizontal axis shows compounds’ target pathways along with the model which achieved the best modeling result. See Fig. 1 for model abbreviations.

Figure 8

Figure 8

Predicted versus actual AUC values and most predictive features for (a) Dabrafenib, (b) Linifanib and (c) Quizartinib. Top panels show predicted versus actual AUC values when both biologically driven and genome-wide models were trained and tested on the same sets of samples. The biologically driven models correspond to best suited feature set for each drug: OT + S RF for Dabrafenib, OT RF for Linifanib and PG RF for Quizartinib. Middle and bottom panels present top 5 most informative features when fitting the model with genome-wide data (middle) and biologically driven feature space (bottom).

Similar articles

Cited by

References

    1. Bedard PL, Hansen AR, Ratain MJ, Siu LL. Tumour heterogeneity in the clinic. Nature. 2013;501:355–364. doi: 10.1038/nature12627. - DOI - PMC - PubMed
    1. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity (vol 483, pg 603, 2012). Nature492, 290–290 (2012). - PMC - PubMed
    1. Benes C, et al. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 2012;41:D955–D961. doi: 10.1093/nar/gks1111. - DOI - PMC - PubMed
    1. Rees, M. et al. Correlating chemical sensitivity and basal gene expression reveals mechanism of action. Nature chemical biology12 (2015). - PMC - PubMed
    1. Seashore-Ludlow B, et al. Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset. Cancer Discovery. 2015;5:1210–1223. doi: 10.1158/2159-8290.CD-15-0235. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources