tf.lookup.StaticVocabularyTable  |  TensorFlow v2.16.1 (original) (raw)

tf.lookup.StaticVocabularyTable

Stay organized with collections Save and categorize content based on your preferences.

String to Id table that assigns out-of-vocabulary keys to hash buckets.

Inherits From: TrackableResource

tf.lookup.StaticVocabularyTable(
    initializer,
    num_oov_buckets,
    lookup_key_dtype=None,
    name=None,
    experimental_is_anonymous=False
)

Used in the notebooks

Used in the guide Used in the tutorials
Subword tokenizers BERT Preprocessing with TF Text Load text Client-efficient large-model federated learning via `federated_select` and sparse aggregation Apache ORC Reader

For example, if an instance of StaticVocabularyTable is initialized with a string-to-id initializer that maps:

init = tf.lookup.KeyValueTensorInitializer( keys=tf.constant(['emerson', 'lake', 'palmer']), values=tf.constant([0, 1, 2], dtype=tf.int64)) table = tf.lookup.StaticVocabularyTable( init, num_oov_buckets=5)

The Vocabulary object will performs the following mapping:

If input_tensor is:

input_tensor = tf.constant(["emerson", "lake", "palmer", "king", "crimson"]) table[input_tensor].numpy() array([0, 1, 2, 6, 7])

If initializer is None, only out-of-vocabulary buckets are used.

Example usage:

num_oov_buckets = 3 vocab = ["emerson", "lake", "palmer", "crimnson"] import tempfile f = tempfile.NamedTemporaryFile(delete=False) f.write('\n'.join(vocab).encode('utf-8')) f.close()

init = tf.lookup.TextFileInitializer( f.name, key_dtype=tf.string, key_index=tf.lookup.TextFileIndex.WHOLE_LINE, value_dtype=tf.int64, value_index=tf.lookup.TextFileIndex.LINE_NUMBER) table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets) table.lookup(tf.constant(["palmer", "crimnson" , "king", "tarkus", "black", "moon"])).numpy() array([2, 3, 5, 6, 6, 4])

The hash function used for generating out-of-vocabulary buckets ID is Fingerprint64.

Note that the out-of-vocabulary bucket IDs always range from the table sizeup to size + num_oov_buckets - 1 regardless of the table values, which could cause unexpected collisions:

init = tf.lookup.KeyValueTensorInitializer( keys=tf.constant(["emerson", "lake", "palmer"]), values=tf.constant([1, 2, 3], dtype=tf.int64)) table = tf.lookup.StaticVocabularyTable( init, num_oov_buckets=1) input_tensor = tf.constant(["emerson", "lake", "palmer", "king"]) table[input_tensor].numpy() array([1, 2, 3, 3])

Args
initializer A TableInitializerBase object that contains the data used to initialize the table. If None, then we only use out-of-vocab buckets.
num_oov_buckets Number of buckets to use for out-of-vocabulary keys. Must be greater than zero. If out-of-vocab buckets are not required, useStaticHashTable instead.
lookup_key_dtype Data type of keys passed to lookup. Defaults toinitializer.key_dtype if initializer is specified, otherwisetf.string. Must be string or integer, and must be castable toinitializer.key_dtype.
name A name for the operation (optional).
experimental_is_anonymous Whether to use anonymous mode for the table (default is False). In anonymous mode, the table resource can only be accessed via a resource handle. It can't be looked up by a name. When all resource handles pointing to that resource are gone, the resource will be deleted automatically.
Raises
ValueError when num_oov_buckets is not positive.
TypeError when lookup_key_dtype or initializer.key_dtype are not integer or string. Also when initializer.value_dtype != int64.
Attributes
key_dtype The table key dtype.
name The name of the table.
resource_handle Returns the resource handle associated with this Resource.
value_dtype The table value dtype.

Methods

lookup

View source

lookup(
    keys, name=None
)

Looks up keys in the table, outputs the corresponding values.

It assigns out-of-vocabulary keys to buckets based in their hashes.

Args
keys Keys to look up. May be either a SparseTensor or dense Tensor.
name Optional name for the op.
Returns
A SparseTensor if keys are sparse, a RaggedTensor if keys are ragged, otherwise a dense Tensor.
Raises
TypeError when keys doesn't match the table key data type.

size

View source

size(
    name=None
)

Compute the number of elements in this table.

__getitem__

View source

__getitem__(
    keys
)

Looks up keys in a table, outputs the corresponding values.