tf.keras.preprocessing.sequence.make_sampling_table | TensorFlow v2.16.1 (original) (raw)

tf.keras.preprocessing.sequence.make_sampling_table

Generates a word rank-based probabilistic sampling table.

tf.keras.preprocessing.sequence.make_sampling_table(
    size, sampling_factor=1e-05
)

Used in the notebooks

Used in the tutorials
word2vec

DEPRECATED.

Used for generating the sampling_table argument for skipgrams.sampling_table[i] is the probability of sampling the word i-th most common word in a dataset (more common words should be sampled less frequently, for balance).

The sampling probabilities are generated according to the sampling distribution used in word2vec:

p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
    (word_frequency / sampling_factor)))

We assume that the word frequencies follow Zipf's law (s=1) to derive a numerical approximation of frequency(rank):

frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))where gamma is the Euler-Mascheroni constant.

Args
size	Int, number of possible words to sample.
sampling_factor	The sampling factor in the word2vec formula.

Returns
A 1D Numpy array of length size where the ith entry is the probability that a word of rank i should be sampled.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.

Last updated 2024-06-07 UTC.