View source on GitHub |
Text tokenization utility class.
tf.keras.preprocessing.text.Tokenizer( num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0, **kwargs )
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...
Arguments | |
---|---|
num_words | the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept. |
filters | a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character. |
lower | boolean. Whether to convert the texts to lowercase. |
split | str. Separator for word splitting. |
char_level | if True, every character will be treated as a token. |
oov_token | if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls |
By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the '
character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.
0
is a reserved index that won't be assigned to any word.
fit_on_sequences
fit_on_sequences( sequences )
Updates internal vocabulary based on a list of sequences.
Required before using sequences_to_matrix
(if fit_on_texts
was never called).
Arguments | |
---|---|
sequences | A list of sequence. A "sequence" is a list of integer word indices. |
fit_on_texts
fit_on_texts( texts )
Updates internal vocabulary based on a list of texts.
In the case where texts contains lists, we assume each entry of the lists to be a token.
Required before using texts_to_sequences
or texts_to_matrix
.
Arguments | |
---|---|
texts | can be a list of strings, a generator of strings (for memory-efficiency), or a list of list of strings. |
get_config
get_config()
Returns the tokenizer configuration as Python dictionary. The word count dictionaries used by the tokenizer get serialized into plain JSON, so that the configuration can be read by other projects.
Returns | |
---|---|
A Python dictionary with the tokenizer configuration. |
sequences_to_matrix
sequences_to_matrix( sequences, mode='binary' )
Converts a list of sequences into a Numpy matrix.
Arguments | |
---|---|
sequences | list of sequences (a sequence is a list of integer word indices). |
mode | one of "binary", "count", "tfidf", "freq" |
Returns | |
---|---|
A Numpy matrix. |
Raises | |
---|---|
ValueError | In case of invalid mode argument, or if the Tokenizer requires to be fit to sample data. |
sequences_to_texts
sequences_to_texts( sequences )
Transforms each sequence into a list of text.
Only top num_words-1
most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.
Arguments | |
---|---|
sequences | A list of sequences (list of integers). |
Returns | |
---|---|
A list of texts (strings) |
sequences_to_texts_generator
sequences_to_texts_generator( sequences )
Transforms each sequence in sequences
to a list of texts(strings).
Each sequence has to a list of integers. In other words, sequences should be a list of sequences
Only top num_words-1
most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.
Arguments | |
---|---|
sequences | A list of sequences. |
Yields | |
---|---|
Yields individual texts. |
texts_to_matrix
texts_to_matrix( texts, mode='binary' )
Convert a list of texts to a Numpy matrix.
Arguments | |
---|---|
texts | list of strings. |
mode | one of "binary", "count", "tfidf", "freq". |
Returns | |
---|---|
A Numpy matrix. |
texts_to_sequences
texts_to_sequences( texts )
Transforms each text in texts to a sequence of integers.
Only top num_words-1
most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.
Arguments | |
---|---|
texts | A list of texts (strings). |
Returns | |
---|---|
A list of sequences. |
texts_to_sequences_generator
texts_to_sequences_generator( texts )
Transforms each text in texts
to a sequence of integers.
Each item in texts can also be a list, in which case we assume each item of that list to be a token.
Only top num_words-1
most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.
Arguments | |
---|---|
texts | A list of texts (strings). |
Yields | |
---|---|
Yields individual sequences. |
to_json
to_json( **kwargs )
Returns a JSON string containing the tokenizer configuration. To load a tokenizer from a JSON string, use keras.preprocessing.text.tokenizer_from_json(json_string)
.
Arguments | |
---|---|
**kwargs | Additional keyword arguments to be passed to json.dumps() . |
Returns | |
---|---|
A JSON string containing the tokenizer configuration. |
© 2020 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
https://www.tensorflow.org/versions/r2.4/api_docs/python/tf/keras/preprocessing/text/Tokenizer