class sklearn.feature_extraction.text.HashingVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, n_features=1048576, binary=False, norm=’l2’, alternate_sign=True, non_negative=False, dtype=<class ‘numpy.float64’>)
[source]
Convert a collection of text documents to a matrix of token occurrences
It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):
The hash function employed is the signed 32-bit version of Murmurhash3.
Read more in the User Guide.
Parameters: |
|
---|
See also
>>> from sklearn.feature_extraction.text import HashingVectorizer >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> vectorizer = HashingVectorizer(n_features=2**4) >>> X = vectorizer.fit_transform(corpus) >>> print(X.shape) (4, 16)
build_analyzer () | Return a callable that handles preprocessing and tokenization |
build_preprocessor () | Return a function to preprocess the text before tokenization |
build_tokenizer () | Return a function that splits a string into a sequence of tokens |
decode (doc) | Decode the input into a string of unicode symbols |
fit (X[, y]) | Does nothing: this transformer is stateless. |
fit_transform (X[, y]) | Transform a sequence of documents to a document-term matrix. |
get_params ([deep]) | Get parameters for this estimator. |
get_stop_words () | Build or fetch the effective stop words list |
partial_fit (X[, y]) | Does nothing: this transformer is stateless. |
set_params (**params) | Set the parameters of this estimator. |
transform (X) | Transform a sequence of documents to a document-term matrix. |
__init__(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, n_features=1048576, binary=False, norm=’l2’, alternate_sign=True, non_negative=False, dtype=<class ‘numpy.float64’>)
[source]
build_analyzer()
[source]
Return a callable that handles preprocessing and tokenization
build_preprocessor()
[source]
Return a function to preprocess the text before tokenization
build_tokenizer()
[source]
Return a function that splits a string into a sequence of tokens
decode(doc)
[source]
Decode the input into a string of unicode symbols
The decoding strategy depends on the vectorizer parameters.
Parameters: |
|
---|
fit(X, y=None)
[source]
Does nothing: this transformer is stateless.
Parameters: |
|
---|
fit_transform(X, y=None)
[source]
Transform a sequence of documents to a document-term matrix.
Parameters: |
|
---|---|
Returns: |
|
get_params(deep=True)
[source]
Get parameters for this estimator.
Parameters: |
|
---|---|
Returns: |
|
get_stop_words()
[source]
Build or fetch the effective stop words list
partial_fit(X, y=None)
[source]
Does nothing: this transformer is stateless.
This method is just there to mark the fact that this transformer can work in a streaming setup.
Parameters: |
|
---|
set_params(**params)
[source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter>
so that it’s possible to update each component of a nested object.
Returns: |
|
---|
transform(X)
[source]
Transform a sequence of documents to a document-term matrix.
Parameters: |
|
---|---|
Returns: |
|
sklearn.feature_extraction.text.HashingVectorizer
© 2007–2018 The scikit-learn developers
Licensed under the 3-clause BSD License.
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html