Tokenizer
Defined in tensorflow/python/keras/_impl/keras/preprocessing/text.py
.
Text tokenization utility class.
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...
num_words
: the maximum number of words to keep, based on word frequency. Only the most common num_words
words will be kept.filters
: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the '
character.lower
: boolean. Whether to convert the texts to lowercase.split
: character or string to use for token splitting.char_level
: if True, every character will be treated as a token.oov_token
: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence callsBy default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the '
character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.
0
is a reserved index that won't be assigned to any word.
__init__
__init__( num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, **kwargs )
Initialize self. See help(type(self)) for accurate signature.
fit_on_sequences
fit_on_sequences(sequences)
Updates internal vocabulary based on a list of sequences.
Required before using sequences_to_matrix
(if fit_on_texts
was never called).
sequences
: A list of sequence. A "sequence" is a list of integer word indices.fit_on_texts
fit_on_texts(texts)
Updates internal vocabulary based on a list of texts.
In the case where texts contains lists, we assume each entry of the lists to be a token.
Required before using texts_to_sequences
or texts_to_matrix
.
texts
: can be a list of strings, a generator of strings (for memory-efficiency), or a list of list of strings.sequences_to_matrix
sequences_to_matrix( sequences, mode='binary' )
Converts a list of sequences into a Numpy matrix.
sequences
: list of sequences (a sequence is a list of integer word indices).mode
: one of "binary", "count", "tfidf", "freq"A Numpy matrix.
ValueError
: In case of invalid mode
argument, or if the Tokenizer requires to be fit to sample data.texts_to_matrix
texts_to_matrix( texts, mode='binary' )
Convert a list of texts to a Numpy matrix.
texts
: list of strings.mode
: one of "binary", "count", "tfidf", "freq".A Numpy matrix.
texts_to_sequences
texts_to_sequences(texts)
Transforms each text in texts in a sequence of integers.
Only top "num_words" most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.
texts
: A list of texts (strings).A list of sequences.
texts_to_sequences_generator
texts_to_sequences_generator(texts)
Transforms each text in texts
in a sequence of integers.
Each item in texts can also be a list, in which case we assume each item of that list to be a token.
Only top "num_words" most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.
texts
: A list of texts (strings).Yields individual sequences.
© 2018 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer