Integrate - Spark | FLAML (original) (raw)

FLAML has integrated Spark for distributed training. There are two main aspects of integration with Spark:

Spark ML Estimators

FLAML integrates estimators based on Spark ML models. These models are trained in parallel using Spark, so we called them Spark estimators. To use these models, you first need to organize your data in the required format.

Data

For Spark estimators, AutoML only consumes Spark data. FLAML provides a convenient function to_pandas_on_spark in the flaml.automl.spark.utils module to convert your data into a pandas-on-spark (pyspark.pandas) dataframe/series, which Spark estimators require.

This utility function takes data in the form of a pandas.Dataframe or pyspark.sql.Dataframe and converts it into a pandas-on-spark dataframe. It also takes pandas.Series or pyspark.sql.Dataframe and converts it into a pandas-on-spark series. If you pass in a pyspark.pandas.Dataframe, it will not make any changes.

This function also accepts optional arguments index_col and default_index_type.

Here is an example code snippet for Spark Data:

import pandas as pd
from flaml.automl.spark.utils import to_pandas_on_spark

# Creating a dictionary
data = {
    "Square_Feet": [800, 1200, 1800, 1500, 850],
    "Age_Years": [20, 15, 10, 7, 25],
    "Price": [100000, 200000, 300000, 240000, 120000],
}

# Creating a pandas DataFrame
dataframe = pd.DataFrame(data)
label = "Price"

# Convert to pandas-on-spark dataframe
psdf = to_pandas_on_spark(dataframe)

To use Spark ML models you need to format your data appropriately. Specifically, use VectorAssembler to merge all feature columns into a single vector column.

Here is an example of how to use it:

from pyspark.ml.feature import VectorAssembler

columns = psdf.columns
feature_cols = [col for col in columns if col != label]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
psdf = featurizer.transform(psdf.to_spark(index_col="index"))["index", "features"]

Later in conducting the experiment, use your pandas-on-spark data like non-spark data and pass them using X_train, y_train or dataframe, label.

Estimators

Model List

Usage

First, prepare your data in the required format as described in the previous section.

By including the models you intend to try in the estimators_list argument to flaml.automl, FLAML will start trying configurations for these models. If your input is Spark data, FLAML will also use estimators with the _spark postfix by default, even if you haven't specified them.

Here is an example code snippet using SparkML models in AutoML:

import flaml

# prepare your data in pandas-on-spark format as we previously mentioned

automl = flaml.AutoML()
settings = {
    "time_budget": 30,
    "metric": "r2",
    "estimator_list": ["lgbm_spark"],  # this setting is optional
    "task": "regression",
}

automl.fit(
    dataframe=psdf,
    label=label,
    **settings,
)

Link to notebook | Open in colab

Parallel Spark Jobs

You can activate Spark as the parallel backend during parallel tuning in both AutoML and Hyperparameter Tuning, by setting the use_spark to true. FLAML will dispatch your job to the distributed Spark backend using joblib-spark.

Please note that you should not set use_spark to true when applying AutoML and Tuning for Spark Data. This is because only SparkML models will be used for Spark Data in AutoML and Tuning. As SparkML models run in parallel, there is no need to distribute them with use_spark again.

All the Spark-related arguments are stated below. These arguments are available in both Hyperparameter Tuning and AutoML:

An example code snippet for using parallel Spark jobs:

import flaml

automl_experiment = flaml.AutoML()
automl_settings = {
    "time_budget": 30,
    "metric": "r2",
    "task": "regression",
    "n_concurrent_trials": 2,
    "use_spark": True,
    "force_cancel": True,  # Activating the force_cancel option can immediately halt Spark jobs once they exceed the allocated time_budget.
}

automl.fit(
    dataframe=dataframe,
    label=label,
    **automl_settings,
)

Link to notebook | Open in colab