BucketedRandomProjectionLSH (Spark 3.5.5 JavaDoc) (original) (raw)
Object
- org.apache.spark.ml.PipelineStage
- org.apache.spark.ml.Estimator
- org.apache.spark.ml.feature.BucketedRandomProjectionLSH
- org.apache.spark.ml.Estimator
All Implemented Interfaces:
java.io.Serializable, org.apache.spark.internal.Logging, BucketedRandomProjectionLSHParams, LSHParams, Params, HasInputCol, HasOutputCol, HasSeed, DefaultParamsWritable, Identifiable, MLWritable
public class BucketedRandomProjectionLSH
extends Estimator
implements BucketedRandomProjectionLSHParams, HasSeed
This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics.
The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.
References:
1. Wikipedia on Stable Distributions
2. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).
See Also:
Serialized Form
Nested Class Summary
* ### Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging `org.apache.spark.internal.Logging.SparkShellLoggingFilter`
Constructor Summary
Constructors
Constructor and Description BucketedRandomProjectionLSH() BucketedRandomProjectionLSH(String uid) Method Summary
All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type Method and Description DoubleParam bucketLength() The length of each hash bucket, a larger bucket lowers the false negative rate. BucketedRandomProjectionLSH copy(ParamMap extra) Creates a copy of this instance with the same UID and some extra params. T fit(Dataset<?> dataset) Fits a model to the input data. Param inputCol() Param for input column name. static BucketedRandomProjectionLSH load(String path) IntParam numHashTables() Param for the number of hash tables used in LSH OR-amplification. Param outputCol() Param for output column name. static MLReader read() LongParam seed() Param for random seed. BucketedRandomProjectionLSH setBucketLength(double value) BucketedRandomProjectionLSH setInputCol(String value) BucketedRandomProjectionLSH setNumHashTables(int value) BucketedRandomProjectionLSH setOutputCol(String value) BucketedRandomProjectionLSH setSeed(long value) StructType transformSchema(StructType schema) Check transform validity and derive the output schema from the input schema. String uid() An immutable unique ID for the object and its derivatives. * ### Methods inherited from class org.apache.spark.ml.[Estimator](../../../../../org/apache/spark/ml/Estimator.html "class in org.apache.spark.ml") `[fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-org.apache.spark.ml.param.ParamMap-), [fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-org.apache.spark.ml.param.ParamPair-org.apache.spark.ml.param.ParamPair...-), [fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-org.apache.spark.ml.param.ParamPair-scala.collection.Seq-), [fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-scala.collection.Seq-)` * ### Methods inherited from class org.apache.spark.ml.[PipelineStage](../../../../../org/apache/spark/ml/PipelineStage.html "class in org.apache.spark.ml") `[params](../../../../../org/apache/spark/ml/PipelineStage.html#params--)` * ### Methods inherited from class Object `equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait` * ### Methods inherited from interface org.apache.spark.ml.feature.[BucketedRandomProjectionLSHParams](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSHParams.html "interface in org.apache.spark.ml.feature") `[getBucketLength](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSHParams.html#getBucketLength--)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasSeed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html "interface in org.apache.spark.ml.param.shared") `[getSeed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html#getSeed--)` * ### Methods inherited from interface org.apache.spark.ml.param.[Params](../../../../../org/apache/spark/ml/param/Params.html "interface in org.apache.spark.ml.param") `[clear](../../../../../org/apache/spark/ml/param/Params.html#clear-org.apache.spark.ml.param.Param-), [copyValues](../../../../../org/apache/spark/ml/param/Params.html#copyValues-T-org.apache.spark.ml.param.ParamMap-), [defaultCopy](../../../../../org/apache/spark/ml/param/Params.html#defaultCopy-org.apache.spark.ml.param.ParamMap-), [defaultParamMap](../../../../../org/apache/spark/ml/param/Params.html#defaultParamMap--), [explainParam](../../../../../org/apache/spark/ml/param/Params.html#explainParam-org.apache.spark.ml.param.Param-), [explainParams](../../../../../org/apache/spark/ml/param/Params.html#explainParams--), [extractParamMap](../../../../../org/apache/spark/ml/param/Params.html#extractParamMap--), [extractParamMap](../../../../../org/apache/spark/ml/param/Params.html#extractParamMap-org.apache.spark.ml.param.ParamMap-), [get](../../../../../org/apache/spark/ml/param/Params.html#get-org.apache.spark.ml.param.Param-), [getDefault](../../../../../org/apache/spark/ml/param/Params.html#getDefault-org.apache.spark.ml.param.Param-), [getOrDefault](../../../../../org/apache/spark/ml/param/Params.html#getOrDefault-org.apache.spark.ml.param.Param-), [getParam](../../../../../org/apache/spark/ml/param/Params.html#getParam-java.lang.String-), [hasDefault](../../../../../org/apache/spark/ml/param/Params.html#hasDefault-org.apache.spark.ml.param.Param-), [hasParam](../../../../../org/apache/spark/ml/param/Params.html#hasParam-java.lang.String-), [isDefined](../../../../../org/apache/spark/ml/param/Params.html#isDefined-org.apache.spark.ml.param.Param-), [isSet](../../../../../org/apache/spark/ml/param/Params.html#isSet-org.apache.spark.ml.param.Param-), [onParamChange](../../../../../org/apache/spark/ml/param/Params.html#onParamChange-org.apache.spark.ml.param.Param-), [paramMap](../../../../../org/apache/spark/ml/param/Params.html#paramMap--), [params](../../../../../org/apache/spark/ml/param/Params.html#params--), [set](../../../../../org/apache/spark/ml/param/Params.html#set-org.apache.spark.ml.param.Param-T-), [set](../../../../../org/apache/spark/ml/param/Params.html#set-org.apache.spark.ml.param.ParamPair-), [set](../../../../../org/apache/spark/ml/param/Params.html#set-java.lang.String-java.lang.Object-), [setDefault](../../../../../org/apache/spark/ml/param/Params.html#setDefault-org.apache.spark.ml.param.Param-T-), [setDefault](../../../../../org/apache/spark/ml/param/Params.html#setDefault-scala.collection.Seq-), [shouldOwn](../../../../../org/apache/spark/ml/param/Params.html#shouldOwn-org.apache.spark.ml.param.Param-)` * ### Methods inherited from interface org.apache.spark.ml.util.[Identifiable](../../../../../org/apache/spark/ml/util/Identifiable.html "interface in org.apache.spark.ml.util") `[toString](../../../../../org/apache/spark/ml/util/Identifiable.html#toString--)` * ### Methods inherited from interface org.apache.spark.ml.feature.[LSHParams](../../../../../org/apache/spark/ml/feature/LSHParams.html "interface in org.apache.spark.ml.feature") `[getNumHashTables](../../../../../org/apache/spark/ml/feature/LSHParams.html#getNumHashTables--), [validateAndTransformSchema](../../../../../org/apache/spark/ml/feature/LSHParams.html#validateAndTransformSchema-org.apache.spark.sql.types.StructType-)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasInputCol](../../../../../org/apache/spark/ml/param/shared/HasInputCol.html "interface in org.apache.spark.ml.param.shared") `[getInputCol](../../../../../org/apache/spark/ml/param/shared/HasInputCol.html#getInputCol--)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasOutputCol](../../../../../org/apache/spark/ml/param/shared/HasOutputCol.html "interface in org.apache.spark.ml.param.shared") `[getOutputCol](../../../../../org/apache/spark/ml/param/shared/HasOutputCol.html#getOutputCol--)` * ### Methods inherited from interface org.apache.spark.ml.util.[DefaultParamsWritable](../../../../../org/apache/spark/ml/util/DefaultParamsWritable.html "interface in org.apache.spark.ml.util") `[write](../../../../../org/apache/spark/ml/util/DefaultParamsWritable.html#write--)` * ### Methods inherited from interface org.apache.spark.ml.util.[MLWritable](../../../../../org/apache/spark/ml/util/MLWritable.html "interface in org.apache.spark.ml.util") `[save](../../../../../org/apache/spark/ml/util/MLWritable.html#save-java.lang.String-)` * ### Methods inherited from interface org.apache.spark.internal.Logging `$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize`
Constructor Detail
* #### BucketedRandomProjectionLSH public BucketedRandomProjectionLSH(String uid) * #### BucketedRandomProjectionLSH public BucketedRandomProjectionLSH()
Method Detail
* #### load public static [BucketedRandomProjectionLSH](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html "class in org.apache.spark.ml.feature") load(String path) * #### read public static [MLReader](../../../../../org/apache/spark/ml/util/MLReader.html "class in org.apache.spark.ml.util")<T> read() * #### seed public final [LongParam](../../../../../org/apache/spark/ml/param/LongParam.html "class in org.apache.spark.ml.param") seed() Description copied from interface: `[HasSeed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html#seed--)` Param for random seed. Specified by: `[seed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html#seed--)` in interface `[HasSeed](../../../../../org/apache/spark/ml/param/shared/HasSeed.html "interface in org.apache.spark.ml.param.shared")` Returns: (undocumented) * #### bucketLength public [DoubleParam](../../../../../org/apache/spark/ml/param/DoubleParam.html "class in org.apache.spark.ml.param") bucketLength() The length of each hash bucket, a larger bucket lowers the false negative rate. The number of buckets will be `(max L2 norm of input vectors) / bucketLength`. If input vectors are normalized, 1-10 times of pow(numRecords, -1/inputDim) would be a reasonable value Specified by: `[bucketLength](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSHParams.html#bucketLength--)` in interface `[BucketedRandomProjectionLSHParams](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSHParams.html "interface in org.apache.spark.ml.feature")` Returns: (undocumented) * #### uid public String uid() An immutable unique ID for the object and its derivatives. Specified by: `[uid](../../../../../org/apache/spark/ml/util/Identifiable.html#uid--)` in interface `[Identifiable](../../../../../org/apache/spark/ml/util/Identifiable.html "interface in org.apache.spark.ml.util")` Returns: (undocumented) * #### setInputCol public [BucketedRandomProjectionLSH](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html "class in org.apache.spark.ml.feature") setInputCol(String value) * #### setOutputCol public [BucketedRandomProjectionLSH](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html "class in org.apache.spark.ml.feature") setOutputCol(String value) * #### setNumHashTables public [BucketedRandomProjectionLSH](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html "class in org.apache.spark.ml.feature") setNumHashTables(int value) * #### setBucketLength public [BucketedRandomProjectionLSH](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html "class in org.apache.spark.ml.feature") setBucketLength(double value) * #### setSeed public [BucketedRandomProjectionLSH](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html "class in org.apache.spark.ml.feature") setSeed(long value) * #### transformSchema public [StructType](../../../../../org/apache/spark/sql/types/StructType.html "class in org.apache.spark.sql.types") transformSchema([StructType](../../../../../org/apache/spark/sql/types/StructType.html "class in org.apache.spark.sql.types") schema) Check transform validity and derive the output schema from the input schema. We check validity for interactions between parameters during `transformSchema` and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by `Param.validate()`. Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks. Specified by: `[transformSchema](../../../../../org/apache/spark/ml/PipelineStage.html#transformSchema-org.apache.spark.sql.types.StructType-)` in class `[PipelineStage](../../../../../org/apache/spark/ml/PipelineStage.html "class in org.apache.spark.ml")` Parameters: `schema` \- (undocumented) Returns: (undocumented) * #### copy public [BucketedRandomProjectionLSH](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html "class in org.apache.spark.ml.feature") copy([ParamMap](../../../../../org/apache/spark/ml/param/ParamMap.html "class in org.apache.spark.ml.param") extra) Description copied from interface: `[Params](../../../../../org/apache/spark/ml/param/Params.html#copy-org.apache.spark.ml.param.ParamMap-)` Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See `defaultCopy()`. Specified by: `[copy](../../../../../org/apache/spark/ml/param/Params.html#copy-org.apache.spark.ml.param.ParamMap-)` in interface `[Params](../../../../../org/apache/spark/ml/param/Params.html "interface in org.apache.spark.ml.param")` Specified by: `[copy](../../../../../org/apache/spark/ml/Estimator.html#copy-org.apache.spark.ml.param.ParamMap-)` in class `[Estimator](../../../../../org/apache/spark/ml/Estimator.html "class in org.apache.spark.ml")<[BucketedRandomProjectionLSHModel](../../../../../org/apache/spark/ml/feature/BucketedRandomProjectionLSHModel.html "class in org.apache.spark.ml.feature")>` Parameters: `extra` \- (undocumented) Returns: (undocumented) * #### fit public T fit([Dataset](../../../../../org/apache/spark/sql/Dataset.html "class in org.apache.spark.sql")<?> dataset) Description copied from class: `[Estimator](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-)` Fits a model to the input data. Specified by: `[fit](../../../../../org/apache/spark/ml/Estimator.html#fit-org.apache.spark.sql.Dataset-)` in class `[Estimator](../../../../../org/apache/spark/ml/Estimator.html "class in org.apache.spark.ml")<T extends org.apache.spark.ml.feature.LSHModel<T>>` Parameters: `dataset` \- (undocumented) Returns: (undocumented) * #### inputCol public final [Param](../../../../../org/apache/spark/ml/param/Param.html "class in org.apache.spark.ml.param")<String> inputCol() Description copied from interface: `[HasInputCol](../../../../../org/apache/spark/ml/param/shared/HasInputCol.html#inputCol--)` Param for input column name. Specified by: `[inputCol](../../../../../org/apache/spark/ml/param/shared/HasInputCol.html#inputCol--)` in interface `[HasInputCol](../../../../../org/apache/spark/ml/param/shared/HasInputCol.html "interface in org.apache.spark.ml.param.shared")` Returns: (undocumented) * #### numHashTables public final [IntParam](../../../../../org/apache/spark/ml/param/IntParam.html "class in org.apache.spark.ml.param") numHashTables() Description copied from interface: `[LSHParams](../../../../../org/apache/spark/ml/feature/LSHParams.html#numHashTables--)` Param for the number of hash tables used in LSH OR-amplification. LSH OR-amplification can be used to reduce the false negative rate. Higher values for this param lead to a reduced false negative rate, at the expense of added computational complexity. Specified by: `[numHashTables](../../../../../org/apache/spark/ml/feature/LSHParams.html#numHashTables--)` in interface `[LSHParams](../../../../../org/apache/spark/ml/feature/LSHParams.html "interface in org.apache.spark.ml.feature")` Returns: (undocumented) * #### outputCol public final [Param](../../../../../org/apache/spark/ml/param/Param.html "class in org.apache.spark.ml.param")<String> outputCol() Param for output column name. Specified by: `[outputCol](../../../../../org/apache/spark/ml/param/shared/HasOutputCol.html#outputCol--)` in interface `[HasOutputCol](../../../../../org/apache/spark/ml/param/shared/HasOutputCol.html "interface in org.apache.spark.ml.param.shared")` Returns: (undocumented)