BisectingKMeans (Spark 3.5.5 JavaDoc) (original) (raw)

Object
- org.apache.spark.mllib.clustering.BisectingKMeans
All Implemented Interfaces:
org.apache.spark.internal.Logging

public class BisectingKMeans
extends Object
implements org.apache.spark.internal.Logging
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
param: k the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. param: maxIterations the max number of k-means iterations to split clusters (default: 20) param: minDivisibleClusterSize the minimum number of points (if greater than or equal 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1) param: seed a random seed (default: hash value of the class name)
See Also:
Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, KDD Workshop on Text Mining, 2000.

Nested Class Summary

 * ### Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging  
 `org.apache.spark.internal.Logging.SparkShellLoggingFilter`

Constructor Summary

Constructors

Constructor and Description
BisectingKMeans() Constructs with the default configuration

Method Summary

All Methods Instance Methods Concrete Methods

Modifier and Type	Method and Description
String	getDistanceMeasure() The distance suite used by the algorithm.
int	getK() Gets the desired number of leaf clusters.
int	getMaxIterations() Gets the max number of k-means iterations to split clusters.
double	getMinDivisibleClusterSize() Gets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster.
long	getSeed() Gets the random seed.
BisectingKMeansModel	run(JavaRDD<Vector> data) Java-friendly version of run().
BisectingKMeansModel	run(RDD<Vector> input) Runs the bisecting k-means algorithm.
BisectingKMeans	setDistanceMeasure(String distanceMeasure) Set the distance suite used by the algorithm.
BisectingKMeans	setK(int k) Sets the desired number of leaf clusters (default: 4).
BisectingKMeans	setMaxIterations(int maxIterations) Sets the max number of k-means iterations to split clusters (default: 20).
BisectingKMeans	setMinDivisibleClusterSize(double minDivisibleClusterSize) Sets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1).
BisectingKMeans	setSeed(long seed) Sets the random seed (default: hash value of the class name).

   * ### Methods inherited from class Object  
   `equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`  
   * ### Methods inherited from interface org.apache.spark.internal.Logging  
   `$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize`

Constructor Detail

 * #### BisectingKMeans  
 public BisectingKMeans()  
 Constructs with the default configuration

Method Detail

* #### setK  
public [BisectingKMeans](../../../../../org/apache/spark/mllib/clustering/BisectingKMeans.html "class in org.apache.spark.mllib.clustering") setK(int k)  
Sets the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters.  
Parameters:  
`k` \- (undocumented)  
Returns:  
(undocumented)  
* #### getK  
public int getK()  
Gets the desired number of leaf clusters.  
Returns:  
(undocumented)  
* #### setMaxIterations  
public [BisectingKMeans](../../../../../org/apache/spark/mllib/clustering/BisectingKMeans.html "class in org.apache.spark.mllib.clustering") setMaxIterations(int maxIterations)  
Sets the max number of k-means iterations to split clusters (default: 20).  
Parameters:  
`maxIterations` \- (undocumented)  
Returns:  
(undocumented)  
* #### getMaxIterations  
public int getMaxIterations()  
Gets the max number of k-means iterations to split clusters.  
Returns:  
(undocumented)  
* #### setMinDivisibleClusterSize  
public [BisectingKMeans](../../../../../org/apache/spark/mllib/clustering/BisectingKMeans.html "class in org.apache.spark.mllib.clustering") setMinDivisibleClusterSize(double minDivisibleClusterSize)  
Sets the minimum number of points (if greater than or equal to `1.0`) or the minimum proportion of points (if less than `1.0`) of a divisible cluster (default: 1).  
Parameters:  
`minDivisibleClusterSize` \- (undocumented)  
Returns:  
(undocumented)  
* #### getMinDivisibleClusterSize  
public double getMinDivisibleClusterSize()  
Gets the minimum number of points (if greater than or equal to `1.0`) or the minimum proportion of points (if less than `1.0`) of a divisible cluster.  
Returns:  
(undocumented)  
* #### setSeed  
public [BisectingKMeans](../../../../../org/apache/spark/mllib/clustering/BisectingKMeans.html "class in org.apache.spark.mllib.clustering") setSeed(long seed)  
Sets the random seed (default: hash value of the class name).  
Parameters:  
`seed` \- (undocumented)  
Returns:  
(undocumented)  
* #### getSeed  
public long getSeed()  
Gets the random seed.  
Returns:  
(undocumented)  
* #### getDistanceMeasure  
public String getDistanceMeasure()  
The distance suite used by the algorithm.  
Returns:  
(undocumented)  
* #### setDistanceMeasure  
public [BisectingKMeans](../../../../../org/apache/spark/mllib/clustering/BisectingKMeans.html "class in org.apache.spark.mllib.clustering") setDistanceMeasure(String distanceMeasure)  
Set the distance suite used by the algorithm.  
Parameters:  
`distanceMeasure` \- (undocumented)  
Returns:  
(undocumented)  
* #### run  
public [BisectingKMeansModel](../../../../../org/apache/spark/mllib/clustering/BisectingKMeansModel.html "class in org.apache.spark.mllib.clustering") run([RDD](../../../../../org/apache/spark/rdd/RDD.html "class in org.apache.spark.rdd")<[Vector](../../../../../org/apache/spark/mllib/linalg/Vector.html "interface in org.apache.spark.mllib.linalg")> input)  
Runs the bisecting k-means algorithm.  
Parameters:  
`input` \- RDD of vectors  
Returns:  
model for the bisecting kmeans  
* #### run  
public [BisectingKMeansModel](../../../../../org/apache/spark/mllib/clustering/BisectingKMeansModel.html "class in org.apache.spark.mllib.clustering") run([JavaRDD](../../../../../org/apache/spark/api/java/JavaRDD.html "class in org.apache.spark.api.java")<[Vector](../../../../../org/apache/spark/mllib/linalg/Vector.html "interface in org.apache.spark.mllib.linalg")> data)  
Java-friendly version of `run()`.  
Parameters:  
`data` \- (undocumented)  
Returns:  
(undocumented)

BisectingKMeans (Spark 3.5.5 JavaDoc) (original) (raw)

Nested Class Summary

Constructor Summary

Method Summary

Constructor Detail

Method Detail