RegexTokenizer (Spark 3.5.5 JavaDoc) (original) (raw)
Object
- org.apache.spark.ml.PipelineStage
- org.apache.spark.ml.Transformer
- org.apache.spark.ml.UnaryTransformer<String,scala.collection.Seq,RegexTokenizer>
* * org.apache.spark.ml.feature.RegexTokenizer
- org.apache.spark.ml.UnaryTransformer<String,scala.collection.Seq,RegexTokenizer>
- org.apache.spark.ml.Transformer
All Implemented Interfaces:
java.io.Serializable, org.apache.spark.internal.Logging, Params, HasInputCol, HasOutputCol, DefaultParamsWritable, Identifiable, MLWritable
public class RegexTokenizer
extends UnaryTransformer<String,scala.collection.Seq,RegexTokenizer>
implements DefaultParamsWritable
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps
is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
See Also:
Serialized Form
Nested Class Summary
* ### Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging `org.apache.spark.internal.Logging.SparkShellLoggingFilter`
Constructor Summary
Constructors
Constructor and Description RegexTokenizer() RegexTokenizer(String uid) Method Summary
All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type Method and Description RegexTokenizer copy(ParamMap extra) Creates a copy of this instance with the same UID and some extra params. BooleanParam gaps() Indicates whether regex splits on gaps (true) or matches tokens (false). boolean getGaps() int getMinTokenLength() String getPattern() boolean getToLowercase() static RegexTokenizer load(String path) IntParam minTokenLength() Minimum token length, greater than or equal to 0. Param pattern() Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. static MLReader read() RegexTokenizer setGaps(boolean value) RegexTokenizer setMinTokenLength(int value) RegexTokenizer setPattern(String value) RegexTokenizer setToLowercase(boolean value) BooleanParam toLowercase() Indicates whether to convert all characters to lowercase before tokenizing. String toString() String uid() An immutable unique ID for the object and its derivatives. * ### Methods inherited from class org.apache.spark.ml.[UnaryTransformer](../../../../../org/apache/spark/ml/UnaryTransformer.html "class in org.apache.spark.ml") `[inputCol](../../../../../org/apache/spark/ml/UnaryTransformer.html#inputCol--), [outputCol](../../../../../org/apache/spark/ml/UnaryTransformer.html#outputCol--), [setInputCol](../../../../../org/apache/spark/ml/UnaryTransformer.html#setInputCol-java.lang.String-), [setOutputCol](../../../../../org/apache/spark/ml/UnaryTransformer.html#setOutputCol-java.lang.String-), [transform](../../../../../org/apache/spark/ml/UnaryTransformer.html#transform-org.apache.spark.sql.Dataset-), [transformSchema](../../../../../org/apache/spark/ml/UnaryTransformer.html#transformSchema-org.apache.spark.sql.types.StructType-)` * ### Methods inherited from class org.apache.spark.ml.[Transformer](../../../../../org/apache/spark/ml/Transformer.html "class in org.apache.spark.ml") `[transform](../../../../../org/apache/spark/ml/Transformer.html#transform-org.apache.spark.sql.Dataset-org.apache.spark.ml.param.ParamMap-), [transform](../../../../../org/apache/spark/ml/Transformer.html#transform-org.apache.spark.sql.Dataset-org.apache.spark.ml.param.ParamPair-org.apache.spark.ml.param.ParamPair...-), [transform](../../../../../org/apache/spark/ml/Transformer.html#transform-org.apache.spark.sql.Dataset-org.apache.spark.ml.param.ParamPair-scala.collection.Seq-)` * ### Methods inherited from class org.apache.spark.ml.[PipelineStage](../../../../../org/apache/spark/ml/PipelineStage.html "class in org.apache.spark.ml") `[params](../../../../../org/apache/spark/ml/PipelineStage.html#params--)` * ### Methods inherited from class Object `equals, getClass, hashCode, notify, notifyAll, wait, wait, wait` * ### Methods inherited from interface org.apache.spark.ml.util.[DefaultParamsWritable](../../../../../org/apache/spark/ml/util/DefaultParamsWritable.html "interface in org.apache.spark.ml.util") `[write](../../../../../org/apache/spark/ml/util/DefaultParamsWritable.html#write--)` * ### Methods inherited from interface org.apache.spark.ml.util.[MLWritable](../../../../../org/apache/spark/ml/util/MLWritable.html "interface in org.apache.spark.ml.util") `[save](../../../../../org/apache/spark/ml/util/MLWritable.html#save-java.lang.String-)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasInputCol](../../../../../org/apache/spark/ml/param/shared/HasInputCol.html "interface in org.apache.spark.ml.param.shared") `[getInputCol](../../../../../org/apache/spark/ml/param/shared/HasInputCol.html#getInputCol--)` * ### Methods inherited from interface org.apache.spark.ml.param.shared.[HasOutputCol](../../../../../org/apache/spark/ml/param/shared/HasOutputCol.html "interface in org.apache.spark.ml.param.shared") `[getOutputCol](../../../../../org/apache/spark/ml/param/shared/HasOutputCol.html#getOutputCol--)` * ### Methods inherited from interface org.apache.spark.ml.param.[Params](../../../../../org/apache/spark/ml/param/Params.html "interface in org.apache.spark.ml.param") `[clear](../../../../../org/apache/spark/ml/param/Params.html#clear-org.apache.spark.ml.param.Param-), [copyValues](../../../../../org/apache/spark/ml/param/Params.html#copyValues-T-org.apache.spark.ml.param.ParamMap-), [defaultCopy](../../../../../org/apache/spark/ml/param/Params.html#defaultCopy-org.apache.spark.ml.param.ParamMap-), [defaultParamMap](../../../../../org/apache/spark/ml/param/Params.html#defaultParamMap--), [explainParam](../../../../../org/apache/spark/ml/param/Params.html#explainParam-org.apache.spark.ml.param.Param-), [explainParams](../../../../../org/apache/spark/ml/param/Params.html#explainParams--), [extractParamMap](../../../../../org/apache/spark/ml/param/Params.html#extractParamMap--), [extractParamMap](../../../../../org/apache/spark/ml/param/Params.html#extractParamMap-org.apache.spark.ml.param.ParamMap-), [get](../../../../../org/apache/spark/ml/param/Params.html#get-org.apache.spark.ml.param.Param-), [getDefault](../../../../../org/apache/spark/ml/param/Params.html#getDefault-org.apache.spark.ml.param.Param-), [getOrDefault](../../../../../org/apache/spark/ml/param/Params.html#getOrDefault-org.apache.spark.ml.param.Param-), [getParam](../../../../../org/apache/spark/ml/param/Params.html#getParam-java.lang.String-), [hasDefault](../../../../../org/apache/spark/ml/param/Params.html#hasDefault-org.apache.spark.ml.param.Param-), [hasParam](../../../../../org/apache/spark/ml/param/Params.html#hasParam-java.lang.String-), [isDefined](../../../../../org/apache/spark/ml/param/Params.html#isDefined-org.apache.spark.ml.param.Param-), [isSet](../../../../../org/apache/spark/ml/param/Params.html#isSet-org.apache.spark.ml.param.Param-), [onParamChange](../../../../../org/apache/spark/ml/param/Params.html#onParamChange-org.apache.spark.ml.param.Param-), [paramMap](../../../../../org/apache/spark/ml/param/Params.html#paramMap--), [params](../../../../../org/apache/spark/ml/param/Params.html#params--), [set](../../../../../org/apache/spark/ml/param/Params.html#set-org.apache.spark.ml.param.Param-T-), [set](../../../../../org/apache/spark/ml/param/Params.html#set-org.apache.spark.ml.param.ParamPair-), [set](../../../../../org/apache/spark/ml/param/Params.html#set-java.lang.String-java.lang.Object-), [setDefault](../../../../../org/apache/spark/ml/param/Params.html#setDefault-org.apache.spark.ml.param.Param-T-), [setDefault](../../../../../org/apache/spark/ml/param/Params.html#setDefault-scala.collection.Seq-), [shouldOwn](../../../../../org/apache/spark/ml/param/Params.html#shouldOwn-org.apache.spark.ml.param.Param-)` * ### Methods inherited from interface org.apache.spark.internal.Logging `$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitialize`
Constructor Detail
* #### RegexTokenizer public RegexTokenizer(String uid) * #### RegexTokenizer public RegexTokenizer()
Method Detail
* #### load public static [RegexTokenizer](../../../../../org/apache/spark/ml/feature/RegexTokenizer.html "class in org.apache.spark.ml.feature") load(String path) * #### read public static [MLReader](../../../../../org/apache/spark/ml/util/MLReader.html "class in org.apache.spark.ml.util")<T> read() * #### uid public String uid() An immutable unique ID for the object and its derivatives. Specified by: `[uid](../../../../../org/apache/spark/ml/util/Identifiable.html#uid--)` in interface `[Identifiable](../../../../../org/apache/spark/ml/util/Identifiable.html "interface in org.apache.spark.ml.util")` Returns: (undocumented) * #### minTokenLength public [IntParam](../../../../../org/apache/spark/ml/param/IntParam.html "class in org.apache.spark.ml.param") minTokenLength() Minimum token length, greater than or equal to 0\. Default: 1, to avoid returning empty strings Returns: (undocumented) * #### setMinTokenLength public [RegexTokenizer](../../../../../org/apache/spark/ml/feature/RegexTokenizer.html "class in org.apache.spark.ml.feature") setMinTokenLength(int value) * #### getMinTokenLength public int getMinTokenLength() * #### gaps public [BooleanParam](../../../../../org/apache/spark/ml/param/BooleanParam.html "class in org.apache.spark.ml.param") gaps() Indicates whether regex splits on gaps (true) or matches tokens (false). Default: true Returns: (undocumented) * #### setGaps public [RegexTokenizer](../../../../../org/apache/spark/ml/feature/RegexTokenizer.html "class in org.apache.spark.ml.feature") setGaps(boolean value) * #### getGaps public boolean getGaps() * #### pattern public [Param](../../../../../org/apache/spark/ml/param/Param.html "class in org.apache.spark.ml.param")<String> pattern() Regex pattern used to match delimiters if `gaps` is true or tokens if `gaps` is false. Default: `"\\s+"` Returns: (undocumented) * #### setPattern public [RegexTokenizer](../../../../../org/apache/spark/ml/feature/RegexTokenizer.html "class in org.apache.spark.ml.feature") setPattern(String value) * #### getPattern public String getPattern() * #### toLowercase public final [BooleanParam](../../../../../org/apache/spark/ml/param/BooleanParam.html "class in org.apache.spark.ml.param") toLowercase() Indicates whether to convert all characters to lowercase before tokenizing. Default: true Returns: (undocumented) * #### setToLowercase public [RegexTokenizer](../../../../../org/apache/spark/ml/feature/RegexTokenizer.html "class in org.apache.spark.ml.feature") setToLowercase(boolean value) * #### getToLowercase public boolean getToLowercase() * #### copy public [RegexTokenizer](../../../../../org/apache/spark/ml/feature/RegexTokenizer.html "class in org.apache.spark.ml.feature") copy([ParamMap](../../../../../org/apache/spark/ml/param/ParamMap.html "class in org.apache.spark.ml.param") extra) Description copied from interface: `[Params](../../../../../org/apache/spark/ml/param/Params.html#copy-org.apache.spark.ml.param.ParamMap-)` Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See `defaultCopy()`. Specified by: `[copy](../../../../../org/apache/spark/ml/param/Params.html#copy-org.apache.spark.ml.param.ParamMap-)` in interface `[Params](../../../../../org/apache/spark/ml/param/Params.html "interface in org.apache.spark.ml.param")` Overrides: `[copy](../../../../../org/apache/spark/ml/UnaryTransformer.html#copy-org.apache.spark.ml.param.ParamMap-)` in class `[UnaryTransformer](../../../../../org/apache/spark/ml/UnaryTransformer.html "class in org.apache.spark.ml")<String,scala.collection.Seq<String>,[RegexTokenizer](../../../../../org/apache/spark/ml/feature/RegexTokenizer.html "class in org.apache.spark.ml.feature")>` Parameters: `extra` \- (undocumented) Returns: (undocumented) * #### toString public String toString() Specified by: `[toString](../../../../../org/apache/spark/ml/util/Identifiable.html#toString--)` in interface `[Identifiable](../../../../../org/apache/spark/ml/util/Identifiable.html "interface in org.apache.spark.ml.util")` Overrides: `toString` in class `Object`