Shadow Variable Search on the Pima Indian Diabetes Data Set – mlr-org (original) (raw)

Scope

Feature selection is the process of finding an optimal set of features to improve the performance, interpretability and robustness of machine learning algorithms. In this article, we introduce the Shadow Variable Search algorithm which is a wrapper method for feature selection. Wrapper methods iteratively add features to the model that optimize a performance measure. As an example, we will search for the optimal set of features for a support vector machine on the Pima Indian Diabetes data set. We assume that you are already familiar with the basic building blocks of the mlr3 ecosystem. If you are new to feature selection, we recommend reading the feature selection chapter of the mlr3book first. Some knowledge about mlr3pipelines is beneficial but not necessary to understand the example.

Adding shadow variables to a data set is a well-known method in machine learning (Wu, Boos, and Stefanski 2007; Thomas et al. 2017). The idea is to add permutated copies of the original features to the data set. These permutated copies are called shadow variables or pseudovariables and the permutation breaks any relationship with the target variable, making them useless for prediction. The subsequent search is similar to the sequential forward selection algorithm, where one new feature is added in each iteration of the algorithm. This new feature is selected as the one that improves the performance of the model the most. This selection is computationally expensive, as one model for each of the not yet included features has to be trained. The difference between shadow variable search and sequential forward selection is that the former uses the selection of a shadow variable as the termination criterion. Selecting a shadow variable means that the best improvement is achieved by adding a feature that is unrelated to the target variable. Consequently, the variables not yet selected are most likely also correlated to the target variable only by chance. Therefore, only the previously selected features have a true influence on the target variable.

mlr3fselect is the feature selection package of the mlr3 ecosystem. It implements the shadow variable search algorithm. We load all packages of the ecosystem with the mlr3verse package.

We retrieve the shadow variable search optimizer with the fs() function. The algorithm has no control parameters.

optimizer = fs("shadow_variable_search")

Task and Learner

The objective of the Pima Indian Diabetes data set is to predict whether a person has diabetes or not. The data set includes 768 patients with 8 measurements (see Figure 1).

Code

library(ggplot2)
library(data.table)

data = melt(as.data.table(task), id.vars = task$target_names, measure.vars = task$feature_names)

ggplot(data, aes(x = value, fill = diabetes)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ variable, ncol = 8, scales = "free") +
  scale_fill_viridis_d(end = 0.8) +
  theme_minimal() +
  theme(axis.title.x = element_blank())

Figure 1: Distribution of the features in the Pima Indian Diabetes data set.

The data set contains missing values.

diabetes      age  glucose  insulin     mass pedigree pregnant pressure  triceps 
       0        0        5      374       11        0        0       35      227 

Support vector machines cannot handle missing values. We impute the missing values with the histogram imputation method.

learner = po("imputehist") %>>% lrn("classif.svm", predict_type = "prob")

Feature Selection

Now we define the feature selection problem by using the fsi() function that constructs an FSelectInstanceBatchSingleCrit. In addition to the task and learner, we have to select a resampling strategy and performance measure to determine how the performance of a feature subset is evaluated. We pass the "none" terminator because the shadow variable search algorithm terminates by itself.

instance = fsi(
  task = task,
  learner = learner,
  resampling = rsmp("cv", folds = 3),
  measures = msr("classif.auc"),
  terminator = trm("none")
)

We are now ready to start the shadow variable search. To do this, we simply pass the instance to the $optimize() method of the optimizer.

optimizer$optimize(instance)
      age glucose insulin   mass pedigree pregnant pressure triceps                  features n_features classif.auc
   <lgcl>  <lgcl>  <lgcl> <lgcl>   <lgcl>   <lgcl>   <lgcl>  <lgcl>                    <list>      <int>       <num>
1:   TRUE    TRUE   FALSE   TRUE     TRUE    FALSE    FALSE   FALSE age,glucose,mass,pedigree          4    0.835165

The optimizer returns the best feature set and the corresponding estimated performance.

Figure 2 shows the optimization path of the feature selection. The feature glucose was selected first and in the following iterations age, mass and pedigree. Then a shadow variable was selected and the feature selection was terminated.

Code

library(data.table)
library(ggplot2)
library(mlr3misc)
library(viridisLite)

data = as.data.table(instance$archive)[order(-classif.auc), head(.SD, 1), by = batch_nr][order(batch_nr)]
data[, features := map_chr(features, str_collapse)]
data[, batch_nr := as.character(batch_nr)]

ggplot(data, aes(x = batch_nr, y = classif.auc)) +
  geom_bar(
    stat = "identity",
    width = 0.5,
    fill = viridis(1, begin = 0.5),
    alpha = 0.8) +
  geom_text(
    data = data,
    mapping = aes(x = batch_nr, y = 0, label = features),
    hjust = 0,
    nudge_y = 0.05,
    color = "white",
    size = 5
    ) +
  coord_flip() +
  xlab("Iteration") +
  theme_minimal()

Figure 2: Optimization path of the shadow variable search.

The archive contains all evaluated feature sets. We can see that each feature has a corresponding shadow variable. We only show the variables age, glucose and insulin and their shadow variables here.

as.data.table(instance$archive)[, .(age, glucose, insulin, permuted__age, permuted__glucose, permuted__insulin, classif.auc)]
       age glucose insulin permuted__age permuted__glucose permuted__insulin classif.auc
    <lgcl>  <lgcl>  <lgcl>        <lgcl>            <lgcl>            <lgcl>       <num>
 1:   TRUE   FALSE   FALSE         FALSE             FALSE             FALSE   0.6437052
 2:  FALSE    TRUE   FALSE         FALSE             FALSE             FALSE   0.7598155
 3:  FALSE   FALSE    TRUE         FALSE             FALSE             FALSE   0.4900280
 4:  FALSE   FALSE   FALSE         FALSE             FALSE             FALSE   0.6424026
 5:  FALSE   FALSE   FALSE         FALSE             FALSE             FALSE   0.5690107
---                                                                                     
54:   TRUE    TRUE   FALSE         FALSE             FALSE             FALSE   0.8266713
55:   TRUE    TRUE   FALSE         FALSE             FALSE             FALSE   0.8063568
56:   TRUE    TRUE   FALSE         FALSE             FALSE             FALSE   0.8244232
57:   TRUE    TRUE   FALSE         FALSE             FALSE             FALSE   0.8234605
58:   TRUE    TRUE   FALSE         FALSE             FALSE             FALSE   0.8164784

Final Model

The learner we use to make predictions on new data is called the final model. The final model is trained with the optimal feature set on the full data set. We subset the task to the optimal feature set and train the learner.

task$select(instance$result_feature_set)
learner$train(task)

The trained model can now be used to predict new, external data.

Conclusion

The shadow variable search is a fast feature selection method that is easy to use. More information on the theoretical background can be found in Wu, Boos, and Stefanski (2007) and Thomas et al. (2017). If you want to know more about feature selection in general, we recommend having a look at our book.

Session Information

sessioninfo::session_info(info = "packages")
═ Session info ═══════════════════════════════════════════════════════════════════════════════════════════════════════
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 ! package           * version    date (UTC) lib source
   backports           1.5.0      2024-05-23 [1] CRAN (R 4.4.1)
   bbotk               1.1.1      2024-10-15 [1] CRAN (R 4.4.1)
   checkmate           2.3.2      2024-07-29 [1] CRAN (R 4.4.1)
 P class               7.3-22     2023-05-03 [?] CRAN (R 4.4.0)
   cli                 3.6.3      2024-06-21 [1] CRAN (R 4.4.1)
   clue                0.3-65     2023-09-23 [1] CRAN (R 4.4.1)
 P cluster             2.1.6      2023-12-01 [?] CRAN (R 4.4.0)
 P codetools           0.2-20     2024-03-31 [?] CRAN (R 4.4.0)
   colorspace          2.1-1      2024-07-26 [1] CRAN (R 4.4.1)
   crayon              1.5.3      2024-06-20 [1] CRAN (R 4.4.1)
   data.table        * 1.16.2     2024-10-10 [1] CRAN (R 4.4.1)
   DEoptimR            1.1-3      2023-10-07 [1] CRAN (R 4.4.1)
   digest              0.6.37     2024-08-19 [1] CRAN (R 4.4.1)
   diptest             0.77-1     2024-04-10 [1] CRAN (R 4.4.1)
   dplyr               1.1.4      2023-11-17 [1] CRAN (R 4.4.1)
   e1071               1.7-16     2024-09-16 [1] CRAN (R 4.4.1)
   evaluate            1.0.1      2024-10-10 [1] CRAN (R 4.4.1)
   fansi               1.0.6      2023-12-08 [1] CRAN (R 4.4.1)
   farver              2.1.2      2024-05-13 [1] CRAN (R 4.4.1)
   fastmap             1.2.0      2024-05-15 [1] CRAN (R 4.4.1)
   flexmix             2.3-19     2023-03-16 [1] CRAN (R 4.4.1)
   fpc                 2.2-13     2024-09-24 [1] CRAN (R 4.4.1)
   future              1.34.0     2024-07-29 [1] CRAN (R 4.4.1)
   future.apply        1.11.2     2024-03-28 [1] CRAN (R 4.4.1)
   generics            0.1.3      2022-07-05 [1] CRAN (R 4.4.1)
   ggplot2           * 3.5.1      2024-04-23 [1] CRAN (R 4.4.1)
   globals             0.16.3     2024-03-08 [1] CRAN (R 4.4.1)
   glue                1.8.0      2024-09-30 [1] CRAN (R 4.4.1)
   gtable              0.3.5      2024-04-22 [1] CRAN (R 4.4.1)
   htmltools           0.5.8.1    2024-04-04 [1] CRAN (R 4.4.1)
   htmlwidgets         1.6.4      2023-12-06 [1] CRAN (R 4.4.1)
   jsonlite            1.8.9      2024-09-20 [1] CRAN (R 4.4.1)
   kernlab             0.9-33     2024-08-13 [1] CRAN (R 4.4.1)
   knitr               1.48       2024-07-07 [1] CRAN (R 4.4.1)
   labeling            0.4.3      2023-08-29 [1] CRAN (R 4.4.1)
 P lattice             0.22-5     2023-10-24 [?] CRAN (R 4.3.3)
   lgr                 0.4.4      2022-09-05 [1] CRAN (R 4.4.1)
   lifecycle           1.0.4      2023-11-07 [1] CRAN (R 4.4.1)
   listenv             0.9.1      2024-01-29 [1] CRAN (R 4.4.1)
   magrittr            2.0.3      2022-03-30 [1] CRAN (R 4.4.1)
 P MASS                7.3-61     2024-06-13 [?] CRAN (R 4.4.1)
   mclust              6.1.1      2024-04-29 [1] CRAN (R 4.4.1)
   mlr3              * 0.21.1     2024-10-18 [1] CRAN (R 4.4.1)
   mlr3cluster         0.1.10     2024-10-03 [1] CRAN (R 4.4.1)
   mlr3data            0.7.0      2023-06-29 [1] CRAN (R 4.4.1)
   mlr3extralearners   0.9.0-9000 2024-10-18 [1] Github (mlr-org/mlr3extralearners@a622524)
   mlr3filters         0.8.0      2024-04-10 [1] CRAN (R 4.4.1)
   mlr3fselect         1.1.1.9000 2024-10-18 [1] Github (mlr-org/mlr3fselect@e917a02)
   mlr3hyperband       0.6.0      2024-06-29 [1] CRAN (R 4.4.1)
   mlr3learners        0.7.0      2024-06-28 [1] CRAN (R 4.4.1)
   mlr3mbo             0.2.6      2024-10-16 [1] CRAN (R 4.4.1)
   mlr3measures        1.0.0      2024-09-11 [1] CRAN (R 4.4.1)
   mlr3misc          * 0.15.1     2024-06-24 [1] CRAN (R 4.4.1)
   mlr3pipelines       0.7.0      2024-09-24 [1] CRAN (R 4.4.1)
   mlr3tuning          1.0.2      2024-10-14 [1] CRAN (R 4.4.1)
   mlr3tuningspaces    0.5.1      2024-06-21 [1] CRAN (R 4.4.1)
   mlr3verse         * 0.3.0      2024-06-30 [1] CRAN (R 4.4.1)
   mlr3viz             0.9.0      2024-07-01 [1] CRAN (R 4.4.1)
   mlr3website       * 0.0.0.9000 2024-10-18 [1] Github (mlr-org/mlr3website@20d1ddf)
   modeltools          0.2-23     2020-03-05 [1] CRAN (R 4.4.1)
   munsell             0.5.1      2024-04-01 [1] CRAN (R 4.4.1)
 P nnet                7.3-19     2023-05-03 [?] CRAN (R 4.3.3)
   palmerpenguins      0.1.1      2022-08-15 [1] CRAN (R 4.4.1)
   paradox             1.0.1      2024-07-09 [1] CRAN (R 4.4.1)
   parallelly          1.38.0     2024-07-27 [1] CRAN (R 4.4.1)
   pillar              1.9.0      2023-03-22 [1] CRAN (R 4.4.1)
   pkgconfig           2.0.3      2019-09-22 [1] CRAN (R 4.4.1)
   prabclus            2.3-4      2024-09-24 [1] CRAN (R 4.4.1)
   proxy               0.4-27     2022-06-09 [1] CRAN (R 4.4.1)
   R6                  2.5.1      2021-08-19 [1] CRAN (R 4.4.1)
   Rcpp                1.0.13     2024-07-17 [1] CRAN (R 4.4.1)
   renv                1.0.11     2024-10-12 [1] CRAN (R 4.4.1)
   rlang               1.1.4      2024-06-04 [1] CRAN (R 4.4.1)
   rmarkdown           2.28       2024-08-17 [1] CRAN (R 4.4.1)
   robustbase          0.99-4-1   2024-09-27 [1] CRAN (R 4.4.1)
   scales              1.3.0      2023-11-28 [1] CRAN (R 4.4.1)
   sessioninfo         1.2.2      2021-12-06 [1] CRAN (R 4.4.1)
   spacefillr          0.3.3      2024-05-22 [1] CRAN (R 4.4.1)
   stringi             1.8.4      2024-05-06 [1] CRAN (R 4.4.1)
   tibble              3.2.1      2023-03-20 [1] CRAN (R 4.4.1)
   tidyselect          1.2.1      2024-03-11 [1] CRAN (R 4.4.1)
   utf8                1.2.4      2023-10-22 [1] CRAN (R 4.4.1)
   uuid                1.2-1      2024-07-29 [1] CRAN (R 4.4.1)
   vctrs               0.6.5      2023-12-01 [1] CRAN (R 4.4.1)
   viridisLite       * 0.4.2      2023-05-02 [1] CRAN (R 4.4.1)
   withr               3.0.1      2024-07-31 [1] CRAN (R 4.4.1)
   xfun                0.48       2024-10-03 [1] CRAN (R 4.4.1)
   yaml                2.3.10     2024-07-26 [1] CRAN (R 4.4.1)

 [1] /home/marc/repositories/mlr3website/mlr-org/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu
 [2] /home/marc/.cache/R/renv/sandbox/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu/9a444a72

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

References

Thomas, Janek, Tobias Hepp, Andreas Mayr, and Bernd Bischl. 2017. “Probing for Sparse and Fast Variable Selection with Model-Based Boosting.” Computational and Mathematical Methods in Medicine 2017 (July): e1421409. https://doi.org/10.1155/2017/1421409.

Wu, Yujun, Dennis D Boos, and Leonard A Stefanski. 2007. “Controlling Variable Selection by the Addition of Pseudovariables.” Journal of the American Statistical Association 102 (477): 235–43. https://doi.org/10.1198/016214506000000843.