GitHub - tidymodels/parsnip: A tidy unified interface to models (original) (raw)

parsnip a drawing of a parsnip on a beige background

R-CMD-check Codecov test coverage CRAN status Downloads lifecycle Codecov test coverage

Introduction

The goal of parsnip is to provide a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.

Installation

The easiest way to get parsnip is to install all of tidymodels:

install.packages("tidymodels")

Alternatively, install just parsnip:

install.packages("parsnip")

Or the development version from GitHub:

install.packages("pak")

pak::pak("tidymodels/parsnip")

Getting started

One challenge with different modeling functions available in R that do the same thing is that they can have different interfaces and arguments. For example, to fit a random forest regression model, we might have:

From randomForest

rf_1 <- randomForest( y ~ ., data = dat, mtry = 10, ntree = 2000, importance = TRUE )

From ranger

rf_2 <- ranger( y ~ ., data = dat, mtry = 10, num.trees = 2000, importance = "impurity" )

From sparklyr

rf_3 <- ml_random_forest( dat, intercept = FALSE, response = "y", features = names(dat)[names(dat) != "y"], col.sample.rate = 10, num.trees = 2000 )

Note that the model syntax can be very different and that the argument names (and formats) are also different. This is a pain if you switch between implementations.

In this example:

The goals of parsnip are to:

Using the example above, the parsnip approach would be:

library(parsnip)

rand_forest(mtry = 10, trees = 2000) |> set_engine("ranger", importance = "impurity") |> set_mode("regression") #> Random Forest Model Specification (regression) #> #> Main Arguments: #> mtry = 10 #> trees = 2000 #> #> Engine-Specific Arguments: #> importance = impurity #> #> Computational engine: ranger

The engine can be easily changed. To use Spark, the change is straightforward:

rand_forest(mtry = 10, trees = 2000) |> set_engine("spark") |> set_mode("regression") #> Random Forest Model Specification (regression) #> #> Main Arguments: #> mtry = 10 #> trees = 2000 #> #> Computational engine: spark

Either one of these model specifications can be fit in the same way:

set.seed(192) rand_forest(mtry = 10, trees = 2000) |> set_engine("ranger", importance = "impurity") |> set_mode("regression") |> fit(mpg ~ ., data = mtcars) #> parsnip model object #> #> Ranger result #> #> Call: #> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~10, x), num.trees = ~2000, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) #> #> Type: Regression #> Number of trees: 2000 #> Sample size: 32 #> Number of independent variables: 10 #> Mtry: 10 #> Target node size: 5 #> Variable importance mode: impurity #> Splitrule: variance #> OOB prediction error (MSE): 5.976917 #> R squared (OOB): 0.8354559

A list of all parsnip models across different CRAN packages can be found at https://www.tidymodels.org/find/parsnip/.

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.