OneHotEncoder (Spark 3.5.5 JavaDoc) (original) (raw)


public class OneHotEncoder
extends Estimator<OneHotEncoderModel>
implements OneHotEncoderBase, DefaultParamsWritable
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of[0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].
See Also:
StringIndexer for converting categorical values into category indices, Serialized Form
Note:
This is different from scikit-learn's OneHotEncoder, which keeps all categories. The output vectors are sparse.
When handleInvalid is configured to 'keep', an extra "category" indicating invalid values is added as last category. So when dropLast is true, invalid values are encoded as all-zeros vector.
, When encoding multi-column by using inputCols and outputCols params, input/output cols come in pairs, specified by the order in the arrays, and each pair is treated independently.