BisectingKMeans (Spark 3.5.5 JavaDoc) (original) (raw)
Object
- org.apache.spark.ml.PipelineStage
- org.apache.spark.ml.Estimator<BisectingKMeansModel>
- org.apache.spark.ml.clustering.BisectingKMeans
- org.apache.spark.ml.Estimator<BisectingKMeansModel>
All Implemented Interfaces:
java.io.Serializable, org.apache.spark.internal.Logging, BisectingKMeansParams, Params, HasDistanceMeasure, HasFeaturesCol, HasMaxIter, HasPredictionCol, HasSeed, HasWeightCol, DefaultParamsWritable, Identifiable, MLWritable
public class BisectingKMeans
extends Estimator<BisectingKMeansModel>
implements BisectingKMeansParams, DefaultParamsWritable
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k
leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k
leaf clusters, larger clusters get higher priority.
See Also:
Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, KDD Workshop on Text Mining, 2000., Serialized Form
Nested Class Summary
* ### Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging `org.apache.spark.internal.Logging.SparkShellLoggingFilter`
Constructor Summary
Constructors
Constructor and Description BisectingKMeans() BisectingKMeans(String uid) Method Summary
All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type Method and Description BisectingKMeans copy(ParamMap extra) Creates a copy of this instance with the same UID and some extra params. Param distanceMeasure() Param for The distance measure. Param featuresCol() Param for features column name. BisectingKMeansModel fit(Dataset<?> dataset) Fits a model to the input data. IntParam k() The desired number of leaf clusters. static BisectingKMeans load(String path) IntParam maxIter() Param for maximum number of iterations (>= 0). DoubleParam minDivisibleClusterSize() The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0). Param predictionCol() Param for prediction column name. static MLReader read() LongParam seed() Param for random seed. BisectingKMeans setDistanceMeasure(String value) BisectingKMeans setFeaturesCol(String value) BisectingKMeans setK(int value) BisectingKMeans setMaxIter(int value) BisectingKMeans setMinDivisibleClusterSize(double value) BisectingKMeans setPredictionCol(String value) BisectingKMeans setSeed(long value) BisectingKMeans setWeightCol(String value) Sets the value of param weightCol. StructType transformSchema(StructType schema) Check transform validity and derive the output schema from the input schema. String uid() An immutable unique ID for the object and its derivatives. Param weightCol() Param for weight column name. * ### Methods inherited from class org.apache.spark.ml.[Estimator](../../../../../org/apache/spark/ml/Estimator.html "class in org.apache.spark.ml") `[fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-org.apache.spark.ml.param.ParamMap-), [fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-org.apache.spark.ml.param.ParamPair-org.apache.spark.ml.param.ParamPair...-), [fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-org.apache.spark.ml.param.ParamPair-scala.collection.Seq-), [fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-scala.collection.Seq-)` * ### Methods inherited from class org.apache.spark.ml.[PipelineStage](../../../../../org/apache/spark/ml/PipelineStage.html "class in org.apache.spark.ml") `[params](../../../../../org/apache/spark/ml/PipelineStage.html#params--)` * ### Methods inherited from class Object `equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait` * ### Methods inherited from interface org.apache.spark.ml.clustering.[BisectingKMeansParams](../../../../../org/apache/spark/ml/clustering/BisectingKMeansParams.html "interface in org.apache.spark.ml.clustering") `[getK](../../../../../org/apache/spark/ml/clustering/BisectingKMeansParams.html#getK--), [getMinDivisibleClusterSize](../../../../../org/apache/spark/ml/clustering/BisectingKMeansParams.html#getMinDivisibleClusterSize--), [validateAndTransformSchema](../../../../../org/apache/spark/ml/clustering/BisectingKMeansParams.html#validateAndTransformSchema-org.apache.spark.sql.types.StructType-)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasMaxIter](../../../../../org/apache/spark/ml/param/shared/HasMaxIter.html "interface in org.apache.spark.ml.param.shared") `[getMaxIter](../../../../../org/apache/spark/ml/param/shared/HasMaxIter.html#getMaxIter--)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasFeaturesCol](../../../../../org/apache/spark/ml/param/shared/HasFeaturesCol.html "interface in org.apache.spark.ml.param.shared") `[getFeaturesCol](../../../../../org/apache/spark/ml/param/shared/HasFeaturesCol.html#getFeaturesCol--)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasSeed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html "interface in org.apache.spark.ml.param.shared") `[getSeed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html#getSeed--)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasPredictionCol](../../../../../org/apache/spark/ml/param/shared/HasPredictionCol.html "interface in org.apache.spark.ml.param.shared") `[getPredictionCol](../../../../../org/apache/spark/ml/param/shared/HasPredictionCol.html#getPredictionCol--)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasDistanceMeasure](../../../../../org/apache/spark/ml/param/shared/HasDistanceMeasure.html "interface in org.apache.spark.ml.param.shared") `[getDistanceMeasure](../../../../../org/apache/spark/ml/param/shared/HasDistanceMeasure.html#getDistanceMeasure--)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasWeightCol](../../../../../org/apache/spark/ml/param/shared/HasWeightCol.html "interface in org.apache.spark.ml.param.shared") `[getWeightCol](../../../../../org/apache/spark/ml/param/shared/HasWeightCol.html#getWeightCol--)` * ### Methods inherited from interface org.apache.spark.ml.param.[Params](../../../../../org/apache/spark/ml/param/Params.html "interface in org.apache.spark.ml.param") `[clear](../../../../../org/apache/spark/ml/param/Params.html#clear-org.apache.spark.ml.param.Param-), [copyValues](../../../../../org/apache/spark/ml/param/Params.html#copyValues-T-org.apache.spark.ml.param.ParamMap-), [defaultCopy](../../../../../org/apache/spark/ml/param/Params.html#defaultCopy-org.apache.spark.ml.param.ParamMap-), [defaultParamMap](../../../../../org/apache/spark/ml/param/Params.html#defaultParamMap--), [explainParam](../../../../../org/apache/spark/ml/param/Params.html#explainParam-org.apache.spark.ml.param.Param-), [explainParams](../../../../../org/apache/spark/ml/param/Params.html#explainParams--), [extractParamMap](../../../../../org/apache/spark/ml/param/Params.html#extractParamMap--), [extractParamMap](../../../../../org/apache/spark/ml/param/Params.html#extractParamMap-org.apache.spark.ml.param.ParamMap-), [get](../../../../../org/apache/spark/ml/param/Params.html#get-org.apache.spark.ml.param.Param-), [getDefault](../../../../../org/apache/spark/ml/param/Params.html#getDefault-org.apache.spark.ml.param.Param-), [getOrDefault](../../../../../org/apache/spark/ml/param/Params.html#getOrDefault-org.apache.spark.ml.param.Param-), [getParam](../../../../../org/apache/spark/ml/param/Params.html#getParam-java.lang.String-), [hasDefault](../../../../../org/apache/spark/ml/param/Params.html#hasDefault-org.apache.spark.ml.param.Param-), [hasParam](../../../../../org/apache/spark/ml/param/Params.html#hasParam-java.lang.String-), [isDefined](../../../../../org/apache/spark/ml/param/Params.html#isDefined-org.apache.spark.ml.param.Param-), [isSet](../../../../../org/apache/spark/ml/param/Params.html#isSet-org.apache.spark.ml.param.Param-), [onParamChange](../../../../../org/apache/spark/ml/param/Params.html#onParamChange-org.apache.spark.ml.param.Param-), [paramMap](../../../../../org/apache/spark/ml/param/Params.html#paramMap--), [params](../../../../../org/apache/spark/ml/param/Params.html#params--), [set](../../../../../org/apache/spark/ml/param/Params.html#set-org.apache.spark.ml.param.Param-T-), [set](../../../../../org/apache/spark/ml/param/Params.html#set-org.apache.spark.ml.param.ParamPair-), [set](../../../../../org/apache/spark/ml/param/Params.html#set-java.lang.String-java.lang.Object-), [setDefault](../../../../../org/apache/spark/ml/param/Params.html#setDefault-org.apache.spark.ml.param.Param-T-), [setDefault](../../../../../org/apache/spark/ml/param/Params.html#setDefault-scala.collection.Seq-), [shouldOwn](../../../../../org/apache/spark/ml/param/Params.html#shouldOwn-org.apache.spark.ml.param.Param-)` * ### Methods inherited from interface org.apache.spark.ml.util.[Identifiable](../../../../../org/apache/spark/ml/util/Identifiable.html "interface in org.apache.spark.ml.util") `[toString](../../../../../org/apache/spark/ml/util/Identifiable.html#toString--)` * ### Methods inherited from interface org.apache.spark.ml.util.[DefaultParamsWritable](../../../../../org/apache/spark/ml/util/DefaultParamsWritable.html "interface in org.apache.spark.ml.util") `[write](../../../../../org/apache/spark/ml/util/DefaultParamsWritable.html#write--)` * ### Methods inherited from interface org.apache.spark.ml.util.[MLWritable](../../../../../org/apache/spark/ml/util/MLWritable.html "interface in org.apache.spark.ml.util") `[save](../../../../../org/apache/spark/ml/util/MLWritable.html#save-java.lang.String-)` * ### Methods inherited from interface org.apache.spark.internal.Logging `$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize`
Constructor Detail
* #### BisectingKMeans public BisectingKMeans(String uid) * #### BisectingKMeans public BisectingKMeans()
Method Detail
* #### load public static [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") load(String path) * #### read public static [MLReader](../../../../../org/apache/spark/ml/util/MLReader.html "class in org.apache.spark.ml.util")<T> read() * #### k public final [IntParam](../../../../../org/apache/spark/ml/param/IntParam.html "class in org.apache.spark.ml.param") k() The desired number of leaf clusters. Must be > 1\. Default: 4\. The actual number could be smaller if there are no divisible leaf clusters. Specified by: `[k](../../../../../org/apache/spark/ml/clustering/BisectingKMeansParams.html#k--)` in interface `[BisectingKMeansParams](../../../../../org/apache/spark/ml/clustering/BisectingKMeansParams.html "interface in org.apache.spark.ml.clustering")` Returns: (undocumented) * #### minDivisibleClusterSize public final [DoubleParam](../../../../../org/apache/spark/ml/param/DoubleParam.html "class in org.apache.spark.ml.param") minDivisibleClusterSize() The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0). Specified by: `[minDivisibleClusterSize](../../../../../org/apache/spark/ml/clustering/BisectingKMeansParams.html#minDivisibleClusterSize--)` in interface `[BisectingKMeansParams](../../../../../org/apache/spark/ml/clustering/BisectingKMeansParams.html "interface in org.apache.spark.ml.clustering")` Returns: (undocumented) * #### weightCol public final [Param](../../../../../org/apache/spark/ml/param/Param.html "class in org.apache.spark.ml.param")<String> weightCol() Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0. Specified by: `[weightCol](../../../../../org/apache/spark/ml/param/shared/HasWeightCol.html#weightCol--)` in interface `[HasWeightCol](../../../../../org/apache/spark/ml/param/shared/HasWeightCol.html "interface in org.apache.spark.ml.param.shared")` Returns: (undocumented) * #### distanceMeasure public final [Param](../../../../../org/apache/spark/ml/param/Param.html "class in org.apache.spark.ml.param")<String> distanceMeasure() Param for The distance measure. Supported options: 'euclidean' and 'cosine'. Specified by: `[distanceMeasure](../../../../../org/apache/spark/ml/param/shared/HasDistanceMeasure.html#distanceMeasure--)` in interface `[HasDistanceMeasure](../../../../../org/apache/spark/ml/param/shared/HasDistanceMeasure.html "interface in org.apache.spark.ml.param.shared")` Returns: (undocumented) * #### predictionCol public final [Param](../../../../../org/apache/spark/ml/param/Param.html "class in org.apache.spark.ml.param")<String> predictionCol() Param for prediction column name. Specified by: `[predictionCol](../../../../../org/apache/spark/ml/param/shared/HasPredictionCol.html#predictionCol--)` in interface `[HasPredictionCol](../../../../../org/apache/spark/ml/param/shared/HasPredictionCol.html "interface in org.apache.spark.ml.param.shared")` Returns: (undocumented) * #### seed public final [LongParam](../../../../../org/apache/spark/ml/param/LongParam.html "class in org.apache.spark.ml.param") seed() Description copied from interface: `[HasSeed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html#seed--)` Param for random seed. Specified by: `[seed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html#seed--)` in interface `[HasSeed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html "interface in org.apache.spark.ml.param.shared")` Returns: (undocumented) * #### featuresCol public final [Param](../../../../../org/apache/spark/ml/param/Param.html "class in org.apache.spark.ml.param")<String> featuresCol() Param for features column name. Specified by: `[featuresCol](../../../../../org/apache/spark/ml/param/shared/HasFeaturesCol.html#featuresCol--)` in interface `[HasFeaturesCol](../../../../../org/apache/spark/ml/param/shared/HasFeaturesCol.html "interface in org.apache.spark.ml.param.shared")` Returns: (undocumented) * #### maxIter public final [IntParam](../../../../../org/apache/spark/ml/param/IntParam.html "class in org.apache.spark.ml.param") maxIter() Description copied from interface: `[HasMaxIter](../../../../../org/apache/spark/ml/param/shared/HasMaxIter.html#maxIter--)` Param for maximum number of iterations (>= 0). Specified by: `[maxIter](../../../../../org/apache/spark/ml/param/shared/HasMaxIter.html#maxIter--)` in interface `[HasMaxIter](../../../../../org/apache/spark/ml/param/shared/HasMaxIter.html "interface in org.apache.spark.ml.param.shared")` Returns: (undocumented) * #### uid public String uid() An immutable unique ID for the object and its derivatives. Specified by: `[uid](../../../../../org/apache/spark/ml/util/Identifiable.html#uid--)` in interface `[Identifiable](../../../../../org/apache/spark/ml/util/Identifiable.html "interface in org.apache.spark.ml.util")` Returns: (undocumented) * #### copy public [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") copy([ParamMap](../../../../../org/apache/spark/ml/param/ParamMap.html "class in org.apache.spark.ml.param") extra) Description copied from interface: `[Params](../../../../../org/apache/spark/ml/param/Params.html#copy-org.apache.spark.ml.param.ParamMap-)` Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See `defaultCopy()`. Specified by: `[copy](../../../../../org/apache/spark/ml/param/Params.html#copy-org.apache.spark.ml.param.ParamMap-)` in interface `[Params](../../../../../org/apache/spark/ml/param/Params.html "interface in org.apache.spark.ml.param")` Specified by: `[copy](../../../../../org/apache/spark/ml/Estimator.html#copy-org.apache.spark.ml.param.ParamMap-)` in class `[Estimator](../../../../../org/apache/spark/ml/Estimator.html "class in org.apache.spark.ml")<[BisectingKMeansModel](../../../../../org/apache/spark/ml/clustering/BisectingKMeansModel.html "class in org.apache.spark.ml.clustering")>` Parameters: `extra` \- (undocumented) Returns: (undocumented) * #### setFeaturesCol public [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") setFeaturesCol(String value) * #### setPredictionCol public [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") setPredictionCol(String value) * #### setK public [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") setK(int value) * #### setMaxIter public [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") setMaxIter(int value) * #### setSeed public [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") setSeed(long value) * #### setMinDivisibleClusterSize public [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") setMinDivisibleClusterSize(double value) * #### setDistanceMeasure public [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") setDistanceMeasure(String value) * #### setWeightCol public [BisectingKMeans](../../../../../org/apache/spark/ml/clustering/BisectingKMeans.html "class in org.apache.spark.ml.clustering") setWeightCol(String value) Sets the value of param `weightCol`. If this is not set or empty, we treat all instance weights as 1.0\. Default is not set, so all instances have weight one. Parameters: `value` \- (undocumented) Returns: (undocumented) * #### fit public [BisectingKMeansModel](../../../../../org/apache/spark/ml/clustering/BisectingKMeansModel.html "class in org.apache.spark.ml.clustering") fit([Dataset](../../../../../org/apache/spark/sql/Dataset.html "class in org.apache.spark.sql")<?> dataset) Description copied from class: `[Estimator](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-)` Fits a model to the input data. Specified by: `[fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-)` in class `[Estimator](../../../../../org/apache/spark/ml/Estimator.html "class in org.apache.spark.ml")<[BisectingKMeansModel](../../../../../org/apache/spark/ml/clustering/BisectingKMeansModel.html "class in org.apache.spark.ml.clustering")>` Parameters: `dataset` \- (undocumented) Returns: (undocumented) * #### transformSchema public [StructType](../../../../../org/apache/spark/sql/types/StructType.html "class in org.apache.spark.sql.types") transformSchema([StructType](../../../../../org/apache/spark/sql/types/StructType.html "class in org.apache.spark.sql.types") schema) Check transform validity and derive the output schema from the input schema. We check validity for interactions between parameters during `transformSchema` and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by `Param.validate()`. Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks. Specified by: `[transformSchema](../../../../../org/apache/spark/ml/PipelineStage.html#transformSchema-org.apache.spark.sql.types.StructType-)` in class `[PipelineStage](../../../../../org/apache/spark/ml/PipelineStage.html "class in org.apache.spark.ml")` Parameters: `schema` \- (undocumented) Returns: (undocumented)