CatBoost Parameters and Hyperparameters (original) (raw)
Last Updated : 3 Nov, 2025
CatBoost (Categorical Boosting) is a useful machine learning algorithm based on gradient boosting that can handle both numerical and categorical data efficiently. It builds a series of decision trees, where each new tree helps correct the mistakes made by the previous ones. This step-by-step improvement makes CatBoost highly accurate and reliable for different types of prediction tasks.
- Automatically handles categorical features without manual encoding.
- Prevents overfitting through regularization and an advanced boosting scheme.
- Supports fast CPU and GPU training.
- Offers state-of-the-art accuracy with minimal parameter tuning.
Let's understand CatBoost parameters and Hyperparameter Tuning.
1. CatBoost Parameters
Model parameters are internal configurations that the model learns during training. They define how the trees split and how leaf values are adjusted.
**Important Parameters:
- **iterations: Number of boosting iterations (trees) to be built.
- **learning_rate: Step size controlling the speed and stability of convergence.
- **depth: Maximum depth of each decision tree; influences model complexity.
- **l2_leaf_reg: L2 regularization term that helps control overfitting.
- **cat_features: Indices of categorical columns automatically handled by CatBoost.
- **loss_function: Objective function used for training (e.g., RMSE, Logloss, MultiClass).
2. CatBoost Hyperparameters
Hyperparameters are defined before training and govern how the algorithm behaves. Choosing the right combination of these directly affects model performance, generalization and training time.
**Categories of Hyperparameter:
- **Common: General parameters like learning_rate, loss_function and random_seed.
- **Bootstrap: Control data sampling for each tree (bootstrap_type, subsample).
- **Tree Structure: Define tree complexity (depth, min_data_in_leaf, num_leaves).
- **Feature Importance: Affect how features are split (feature_border_type, random_strength).
- **Regularization: Penalize complexity (l2_leaf_reg, leaf_estimation_method).
- **Overfitting Control: Early stopping parameters (use_best_model, eval_metric).
**Important Hyperparameters:
- **learning_rate: Smaller values improve generalization but increase training time.
- **depth: Controls model complexity hence deeper trees capture more patterns but risk overfitting.
- **bagging_temperature: Controls randomness in data sampling hence lower values increase diversity.
- **border_count: Limits splits for numerical features. It affects speed vs. precision trade-off.
- **l2_leaf_reg: Adds L2 penalty to avoid large leaf weights and overfitting.
3. Hyperparameter Tuning
Hyperparameter tuning is the process of finding the most effective set of parameters to maximize model accuracy and minimize errors.
**Steps for Tuning:
- **Define Search Space: Specify possible ranges for parameters such as learning_rate ∈ [0.01, 0.3] or depth ∈ [4, 10].
- **Set Objective Function: Choose a performance metric to optimize, like accuracy, AUC or RMSE.
- **Choose Search Strategy: Use techniques such as grid search, random search, Bayesian optimization or Optuna.
- **Run and Evaluate: Train multiple model configurations, compare performance and select the best-performing combination.
Implementation
Step 1: Installation
Python `
!pip install catboost
`
Step 2: Importing Libraries
We will import the necessary libraries like pandas, scikit learn and catboost.
Python `
import pandas as pd from sklearn.model_selection import train_test_split from catboost import CatBoostClassifier, Pool, cv from sklearn.metrics import accuracy_score
`
Step 3: Load and Prepare Data
We will use the IRIS dataset here.
You can download dataset from here.
Python `
data = pd.read_csv("iris.csv")
X = data.drop('class', axis=1) y = data['class']
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)
`
Step 4: Train CatBoost Model
We will train the model.
Python `
model = CatBoostClassifier(iterations=500, learning_rate=0.1, depth=6, loss_function='MultiClass', random_state=42, verbose=0) model.fit(X_train, y_train)
`
Step 5: Evaluation
We will evaluate the model.
Python `
y_pred = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
`
**Output:
Accuracy: 100.00%
Step 6: Cross-Validation for Robust Evaluation
Python `
pool = Pool(X, label=y) params = {'iterations': 1000, 'learning_rate': 0.01, 'depth': 3, 'loss_function': 'MultiClass', 'random_seed': 42}
cv_results, cv_models = cv(pool=pool, params=params, fold_count=5, verbose=False, return_models=True)
print(cv_results.head())
`
**Output:
Step 7: Mean Loss
We will check mean loss for an example.
Python `
mean_loss = cv_results['test-MultiClass-mean'].iloc[-1] print(f"Mean Loss: {mean_loss * 100:.2f}%")
`
**Output:
Mean Loss: 14.60%
We can see our model is working fine.
Advantages
- **Automatic categorical handling: No need for one-hot or label encoding.
- **Reduced overfitting: Uses Ordered Boosting and built-in regularization.
- **Fast training: Optimized for both CPU and GPU.
- **Excellent accuracy: Competes with or outperforms XGBoost and LightGBM.
- **User-friendly: Simple API with built-in visualization and evaluation tools.
Limitations
- Slower for large datasets compared to LightGBM due to complex encoding.
- Higher memory usage when handling many categorical features.
- Fewer customization options for advanced users compared to XGBoost.
- Limited interoperability with some frameworks like TensorFlow/Keras.