Performing Feature Selection with gridsearchcv in Sklearn (original) (raw)

Last Updated : 23 Jul, 2025

Feature selection is a crucial step in machine learning, as it helps to identify the most relevant features in a dataset that contribute to the model's performance. One effective way to perform feature selection is by combining it with hyperparameter tuning using GridSearchCV from scikit-learn. In this article, we will delve into the details of how to perform feature selection with GridSearchCV in Python.

Table of Content

**Introduction to Feature Selection and Techniques

Feature selection is the process of selecting a subset of relevant features for use in model construction The primary benefits of feature selection include:

There are several feature selection techniques available in scikit-learn, including:

  1. **Recursive Feature Elimination (RFE): This method recursively eliminates the least important features until a specified number of features is reached. It is often used in conjunction with a classifier or regressor.
  2. **SelectKBest: This method selects the top k features according to a scoring function, such as mutual information or F-score.

**Understanding GridSearchCV

GridSearchCV is a powerful tool in scikit-learn that allows for exhaustive search over specified parameter values for an estimator. It is particularly useful for hyperparameter tuning, where the goal is to find the best combination of parameters that result in the highest model performance. The GridSearchCV object takes an estimator, a parameter grid, and a scoring metric as inputs and performs a grid search over the specified parameter values, evaluating the model's performance using the chosen scoring metric. Key components of GridSearchCV:

**Practical Example: Feature Selection with GridSearchCV

To combine feature selection with hyperparameter tuning, we can use the Pipeline class in Scikit-Learn. A pipeline allows us to assemble several steps that can be cross-validated together while setting different parameters. This ensures that all steps are performed sequentially and that the transformations are applied only to the training data within each cross-validation fold.

**Let's walk through an example of performing feature selection with GridSearchCV using a Random Forest classifier.

**Step 1: Import Libraries

Python `

import numpy as np import pandas as pd from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import RFECV from sklearn.metrics import roc_auc_score

`

**Step 2: Load and Prepare Data

Python `

Load dataset

digits = load_digits() X, y = digits.data, digits.target

Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

`

**Step 3: Define the Pipeline

Python `

Define the classifier

clf = RandomForestClassifier(random_state=42, class_weight="balanced")

Define the feature selector

rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')

Create a pipeline

pipeline = Pipeline([ ('feature_selection', rfecv), ('classification', clf) ])

`

**Step 4: Define the Parameter Grid

Python `

Define the parameter grid

param_grid = { 'classification__n_estimators': [200, 500], 'classification__max_features': ['auto', 'sqrt', 'log2'], 'classification__max_depth': [4, 5, 6, 7, 8], 'classification__criterion': ['gini', 'entropy'] }

`

**Step 5: Perform Grid Search with Cross-Validation

Python `

Define the GridSearchCV

grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=StratifiedKFold(10), scoring='roc_auc_ovr', n_jobs=-1)

Fit the model

grid_search.fit(X_train, y_train)

Get the best parameters and score

print("Best parameters found: ", grid_search.best_params_) print("Best cross-validation score: ", grid_search.best_score_)

`

Output:

Best parameters found: {'classification__criterion': 'entropy', 'classification__max_depth': 7, 'classification__max_features': 'sqrt', 'classification__n_estimators': 500}

Best cross-validation score: 0.980

**Step 6: Evaluate the Model on Test Data

Python `

Predict on the test set

y_pred = grid_search.predict(X_test) y_pred_proba = grid_search.predict_proba(X_test)

Evaluate the model

roc_auc = roc_auc_score(y_test, y_pred_proba, multi_class='ovr') print("ROC AUC Score on test data: ", roc_auc)

`

Output:

ROC AUC Score on test data: 0.976

**Best Practices and Tips

  1. **Use Pipelines: Always use pipelines to ensure that feature selection and hyperparameter tuning are performed sequentially and correctly.
  2. **Cross-Validation: Use cross-validation to evaluate model performance and avoid overfitting.
  3. **Scoring Metrics: Choose appropriate scoring metrics based on the problem at hand (e.g., roc_auc for classification).
  4. **Parameter Grid Size: Be mindful of the size of the parameter grid. A very large grid can significantly increase computation time.
  5. **Feature Selection Methods: Experiment with different feature selection methods (e.g., SelectKBest, RFECV) to find the most effective one for your data.

**Conclusion

Combining feature selection with hyperparameter tuning using GridSearchCV in Scikit-Learn is a powerful technique to improve model performance and efficiency. By using pipelines, we can ensure that all steps are performed correctly and sequentially, leading to more robust and reliable models. This guide provides a comprehensive overview and practical example to help you get started with feature selection and hyperparameter tuning in Python.