tf.raw_ops.UnicodeDecode | TensorFlow v2.16.1 (original) (raw)
tf.raw_ops.UnicodeDecode
Stay organized with collections Save and categorize content based on your preferences.
Decodes each string in input
into a sequence of Unicode code points.
View aliases
Compat aliases for migration
SeeMigration guide for more details.
tf.compat.v1.raw_ops.UnicodeDecode
tf.raw_ops.UnicodeDecode(
input,
input_encoding,
errors='replace',
replacement_char=65533,
replace_control_characters=False,
Tsplits=tf.dtypes.int64,
name=None
)
The character codepoints for all strings are returned using a single vectorchar_values
, with strings expanded to characters in row-major order.
The row_splits
tensor indicates where the codepoints for each input string begin and end within the char_values
tensor. In particular, the values for the i
th string (in row-major order) are stored in the slice[row_splits[i]:row_splits[i+1]]
. Thus:
char_values[row_splits[i]+j]
is the Unicode codepoint for thej
th character in thei
th string (in row-major order).row_splits[i+1] - row_splits[i]
is the number of characters in thei
th string (in row-major order).
Args | |
---|---|
input | A Tensor of type string. The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values. |
input_encoding | A string. Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: "UTF-16", "US ASCII", "UTF-8". |
errors | An optional string from: "strict", "replace", "ignore". Defaults to "replace". Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with thereplacement_char codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character. |
replacement_char | An optional int. Defaults to 65533. The replacement character codepoint to be used in place of any invalid formatting in the input when errors='replace'. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.) |
replace_control_characters | An optional bool. Defaults to False. Whether to replace the C0 control characters (00-1F) with thereplacement_char. Default is false. |
Tsplits | An optional tf.DType from: tf.int32, tf.int64. Defaults to tf.int64. |
name | A name for the operation (optional). |
Returns | |
---|---|
A tuple of Tensor objects (row_splits, char_values). | |
row_splits | A Tensor of type Tsplits. |
char_values | A Tensor of type int32. |