## What is the difference between a hashing vectorizer and a tfidf vectorizer

16

5

I'm converting a corpus of text documents into word vectors for each document. I've tried this using a TfidfVectorizer and a HashingVectorizer

I understand that a HashingVectorizer does not take into consideration the IDF scores like a TfidfVectorizer does. The reason I'm still working with a HashingVectorizer is the flexibility it gives while dealing with huge datasets, as explained here and here. (My original dataset has 30 million documents)

Currently, I am working with a sample of 45339 documents, so, I have the ability to work with a TfidfVectorizer also. When I use these two vectorizers on the same 45339 documents, the matrices that I get are different.

hashing = HashingVectorizer()
with LSM('corpus.db')) as corpus:
hashing_matrix = hashing.fit_transform(corpus)
print(hashing_matrix.shape)


hashing matrix shape (45339, 1048576)

tfidf = TfidfVectorizer()
with LSM('corpus.db')) as corpus:
tfidf_matrix = tfidf.fit_transform(corpus)
print(tfidf_matrix.shape)


tfidf matrix shape (45339, 663307)

I want to understand better the differences between a HashingVectorizer and a TfidfVectorizer, and the reason why these matrices are in different sizes - particularly in the number of words/terms.

Can you please share the dataset with me? (response to be removed) – nKarza – 2017-08-15T03:36:02.853

10

The main difference is that HashingVectorizer applies a hashing function to term frequency counts in each document, where TfidfVectorizer scales those term frequency counts in each document by penalising terms that appear more widely across the corpus. There’s a great summary here.

• Hash functions are an efficient way of mapping terms to features; it doesn’t necessarily need to be applied only to term frequencies but that’s how HashingVectorizer is employed here. Along with the 45339 documents, I suspect the feature vector is of length 1048576 because it’s the default 2^20 n_features; you could reduce this and make it less expensive to process but with an increased risk of collision, where the function maps different terms to the same feature.

• Depending on the use case for the word vectors, it may be possible to reduce the length of the hash feature vector (and thus complexity) significantly with acceptable loss to accuracy/effectiveness (due to increased collision). Scikit-learn has some hashing parameters that can assist, for example alternate_sign.

• If the hashing matrix is wider than the dictionary, it will mean that many of the column entries in the hashing matrix will be empty, and not just because a given document doesn't contain a specific term but because they're empty across the whole matrix. If it is not, it might send multiple terms to the same feature hash - this is the 'collision' we've been talking about. HashingVectorizer has a setting that works to mitigate this called alternate_sign that's on by default, described here.

• ‘Term frequency - inverse document frequency’ takes term frequencies in each document and weights them by penalising words that appear more frequently across the whole corpus. The intuition is that terms found situationally are more likely to be representative of a specific document’s topic. This is different to a hashing function in that it is necessary to have a full dictionary of words in the corpus in order to calculate the inverse document frequency. I expect your tf.idf matrix dimensions are 45339 documents by 663307 words in the corpus; Manning et al provide more detail and examples of calculation.

‘Mining of Massive Datasets’ by Leskovec et al has a ton of detail on both feature hashing and tf.idf, the authors made the pdf available here.

1If the tfidf vectorizer needs a full dictionary of words for idf calculations, shouldn't the terms in the tfidf matrix be more than the terms in the hashing matrix? – Minu – 2017-08-15T15:49:46.340

2If the hashing matrix is wider than the dictionary, it will mean that many of the column entries in the hashing matrix will be empty, and not just because a given document doesn't contain a specific term but because they're empty across the whole matrix. Slightly off-topic, but are you doing any processing to the words in your documents before vectorising? Stopwords, stemming, etc? – redhqs – 2017-08-15T19:24:28.347

Yes, I'm processing. I'm using spacy. – Minu – 2017-08-15T19:43:09.273

1Confirmation: So, 1048576 is the default length of any hashing matrix if n_features is not mentioned? If there really are only 663307 words in the corpus, the remaining 385269 features are empty. How can make this hashing matrix snug without all the empty features? – Minu – 2017-08-15T19:47:30.530

1That's right - you can resize the number of features by changing the parameter n_features=1048576, if you have time try 640k, 320k and see if it has much of an impact on your accuracy. It should speed up your training time at least. See @Nathan's answer for n_features=5! – redhqs – 2017-08-15T20:18:45.277

If I limit the number of features n_features to a smaller number, say 50000. Will the hashing function pick the most important 50000 words? Is there an algorithmic mechanism to picking the 50000 terms? Also, I don't see any parameter that allows to pass a list of 50000 words as features to the HashingVectorizer. That right? – Minu – 2017-08-16T18:15:59.967

There's no selection on importance, but it might send multiple terms to the same feature hash - this is the 'collision' we've been talking about. hashingvectorizer has a setting that works to mitigate this called alternate_sign that's on by default, described here: https://en.wikipedia.org/wiki/Feature_hashing#Properties

– redhqs – 2017-08-16T20:17:54.773

could you mention in your answer your first comment (the one I upvoted) so it answers a part of my question, and I can accept it. "If the hashing matrix is wider than the dictionary, it will mean that many of the column entries in the hashing matrix will be empty, and not just because a given document doesn't contain a specific term but because they're empty across the whole matrix." – Minu – 2017-08-18T14:18:30.590

Done, hope the vectorising is working out ok! – redhqs – 2017-08-18T15:13:11.357

6

The HashingVectorizer has a parameter n_features which is 1048576 by default. When hashing, they don't actually compute a dictionary mapping terms to a unique index to use for each one. Instead, you just hash each term and use a large enough size that you don't expect there to be too many collisions: hash(term) mod table_size. You can make the returned matrix be any size you want by setting n_features. You should adjust this to be in the right ballpark for your corpus if you don't think the default is reasonable (having it larger will cause less collisions though it takes more memory).

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer()
print(vectorizer.transform(['a very small document']).shape)
(1, 1048576)

small_vectorizer = HashingVectorizer(n_features=5)
print(small_vectorizer.transform(['a very small document']).shape)
(1, 5)


0

HashingVectorizer and CountVectorizer (note not Tfidfvectorizer) are meant to do the same thing. Which is to convert a collection of text documents to a matrix of token occurrences.

If your are looking to get term frequencies weighted by their relative importance (IDF) then Tfidfvectorizer is what you should use. If you need the raw counts or normalized counts (term frequency), then you should use CountVectorizer or HashingVectorizer.