Generates a word rank-based probabilistic sampling table.

tf.keras.preprocessing.sequence.make_sampling_table( size, sampling_factor=1e-05 )

Used for generating the `sampling_table`

argument for `skipgrams`

. `sampling_table[i]`

is the probability of sampling the word i-th most common word in a dataset (more common words should be sampled less frequently, for balance).

The sampling probabilities are generated according to the sampling distribution used in word2vec:

p(word) = (min(1, sqrt(word_frequency / sampling_factor) / (word_frequency / sampling_factor)))

We assume that the word frequencies follow Zipf's law (s=1) to derive a numerical approximation of frequency(rank):

`frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`

where `gamma`

is the Euler-Mascheroni constant.

size: Int, number of possible words to sample. sampling_factor: The sampling factor in the word2vec formula.

A 1D Numpy array of length `size` where the ith entry is the probability that a word of rank i should be sampled.

© 2020 The TensorFlow Authors. All rights reserved.

Licensed under the Creative Commons Attribution License 3.0.

Code samples licensed under the Apache 2.0 License.

https://www.tensorflow.org/versions/r2.3/api_docs/python/tf/keras/preprocessing/sequence/make_sampling_table