I understand that a
HashingVectorizer does not take into consideration the
IDF scores like a
TfidfVectorizer does. The reason I'm still working with a
HashingVectorizer is the flexibility it gives while dealing with huge datasets, as explained here and here. (My original dataset has 30 million documents)
Currently, I am working with a sample of 45339 documents, so, I have the ability to work with a
TfidfVectorizer also. When I use these two vectorizers on the same 45339 documents, the matrices that I get are different.
hashing = HashingVectorizer() with LSM('corpus.db')) as corpus: hashing_matrix = hashing.fit_transform(corpus) print(hashing_matrix.shape)
hashing matrix shape (45339, 1048576)
tfidf = TfidfVectorizer() with LSM('corpus.db')) as corpus: tfidf_matrix = tfidf.fit_transform(corpus) print(tfidf_matrix.shape)
tfidf matrix shape (45339, 663307)
I want to understand better the differences between a
HashingVectorizer and a
TfidfVectorizer, and the reason why these matrices are in different sizes - particularly in the number of words/terms.