Random Forests(TM) in XGBoost — xgboost 3.1.0-dev documentation (original) (raw)

XGBoost is normally used to train gradient-boosted decision trees and other gradient boosted models. Random Forests use the same model representation and inference, as gradient-boosted decision trees, but a different training algorithm. One can use XGBoost to train a standalone random forest or use random forest as a base model for gradient boosting. Here we focus on training standalone random forest.

We have native APIs for training random forests since the early days, and a new Scikit-Learn wrapper after 0.82 (not included in 0.82). Please note that the new Scikit-Learn wrapper is still experimental, which means we might change the interface whenever needed.

Standalone Random Forest With XGBoost API

The following parameters must be set to enable random forest training.

Other parameters should be set in a similar way they are set for gradient boosting. For instance, objective will typically be reg:squarederror for regression andbinary:logistic for classification, lambda should be set according to a desired regularization weight, etc.

If both num_parallel_tree and num_boost_round are greater than 1, training will use a combination of random forest and gradient boosting strategy. It will performnum_boost_round rounds, boosting a random forest of num_parallel_tree trees at each round. If early stopping is not enabled, the final model will consist ofnum_parallel_tree * num_boost_round trees.

Here is a sample parameter dictionary for training a random forest on a GPU using xgboost:

params = { "colsample_bynode": 0.8, "learning_rate": 1, "max_depth": 5, "num_parallel_tree": 100, "objective": "binary:logistic", "subsample": 0.8, "tree_method": "hist", "device": "cuda", }

A random forest model can then be trained as follows:

bst = train(params, dmatrix, num_boost_round=1)

Standalone Random Forest With Scikit-Learn-Like API

XGBRFClassifier and XGBRFRegressor are SKL-like classes that provide random forest functionality. They are basically versions of XGBClassifier and XGBRegressor that train random forest instead of gradient boosting, and have default values and meaning of some of the parameters adjusted accordingly. In particular:

For a simple example, you can train a random forest regressor with:

from sklearn.model_selection import KFold

Your code ...

kf = KFold(n_splits=2) for train_index, test_index in kf.split(X, y): xgb_model = xgb.XGBRFRegressor(random_state=42).fit( X[train_index], y[train_index])

Note that these classes have a smaller selection of parameters compared to usingtrain(). In particular, it is impossible to combine random forests with gradient boosting using this API.

Caveats