## What is the difference between CountVectorizer token counts and TfidfTransformer with use_idf set to False?

15

9

We can use CountVectorizer to count the number of times a word occurs in a corpus:

# Tokenizing text
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)


If we convert this to a data frame, we can see what the tokens look like:

For example, the 35,780th word of the 3rd document occurs twice.

We can use TfidfTransformer to count the number of times a word occurs in a corpus (only the term frequency and not the inverse) as follows:

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)


Converting this to a data frame, we get:

We can see the representation is different. The TF is shown as 0.15523. Why is this different than the token count using CountVectorizer?

## Answers

14

Actually, the documentation was pretty clear. I'll keep it posted in case someone else searches before reading:

The TfidfTransformer transforms a count matrix to a normalized tf or tf-idf representation. So although both the CountVectorizer and TfidfTransformer (with use_idf=False) produce term frequencies, TfidfTransformer is normalizing the count.

9

Speaking only for myself, I find it so much easier to work out these things by using the simplest examples I can find, rather than those big monster texts that sklearn provides. The monster texts are useful later, but in figuring out the difference between CountVectorizer and TfidfVectorizer the sheer volume of words aren't very helpful when you're trying to pick through the bones to see what's connected to what.

Consider, then, a case where we create two vectorizers, and two very simple examples:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

cv = CountVectorizer(stop_words='english')
tv = TfidfVectorizer(stop_words='english')

a = "cat hat bat splat cat bat hat mat cat"
b = "cat mat cat sat"


Calling fit_transform() on either vectorizer with our list of documents, [a,b], as the argument in each case, returns the same type of object – a 2x6 sparse matrix with 8 stored elements in Compressed Sparse Row format. The only difference is that the TfidfVectorizer() returns floats while the CountVectorizer() returns ints. And that’s to be expected – as explained in the documentation quoted above, TfidfVectorizer() assigns a score while CountVectorizer() counts.

cv_score = cv.fit_transform([a, b])
cv_score
>>> <2x6 sparse matrix of type '<class 'numpy.int64'>'
with 8 stored elements in Compressed Sparse Row format>

tv_score = tv.fit_transform([a, b])
tv_score
>>> <2x6 sparse matrix of type '<class 'numpy.float64'>'
with 8 stored elements in Compressed Sparse Row format>


The difference between our scoring matrices is easier to see if we iterate over them – hence the choosing of the nice short text for our example. First, we’ll write a function to convert our spare matrices to good old lists.

def matrix_to_list(matrix):
matrix = matrix.toarray()
return matrix.tolist()

cv_score_list = matrix_to_list(cv_score)
cv_score_list
>>> [[2, 3, 2, 1, 0, 1], [0, 2, 0, 1, 1, 0]]

tv_score_list = matrix_to_list(tv_score)
tv_score_list
>>> [[0.5333344767907123,
0.5692078092660131,
0.5333344767907123,
0.18973593642200434,
0.0,
0.26666723839535617],
[0.0, 0.7572644142929534, 0.0, 0.3786322071464767, 0.5321543559503558, 0.0]]


Each list contains two lists as elements – the score for a and b respectively in the case of the tv_score_list, and the word count for a and b respectively in the case of the cv_score_list. We couple this with the knowledge that the (alphabetical) order of words if we call .get_feature_names() on either vectorizer matches the order of the scoring. As such, we can iterate nicely over our four lists (2x2) and our word list to make a useful table:

print("tfidf_a  tfidf_b  count_a count_b   word")
print("-"*41)
for i in range(6):
print("  {:.3f}    {:.3f}        {:}       {:}   {:}".format(tv_score_list[0][i],
tv_score_list[1][i],
cv_score_list[0][i],
cv_score_list[1][i],
cv.get_feature_names()[i]))

tfidf_a  tfidf_b  count_a count_b   word
-----------------------------------------
0.533    0.000        2       0   bat
0.569    0.757        3       2   cat
0.533    0.000        2       0   hat
0.190    0.379        1       1   mat
0.000    0.532        0       1   sat
0.267    0.000        1       0   splat


Looking at the table allows us to get a feel for how the tfidf algorithm works. For instance, splat appears once in a and not at all in b, while sat appears once in b and not at all in a. Yet splat scores 0.267 while sat scores 0.532. Also notice that while mat appears in both lists, it scores 0.19 for a and 0.379 for b. This is the algorithm’s distinction between term frequency and document frequency at work.

Thanks for bearing with me. I hope this some help to someone.

2

Transform a count matrix to a normalized tf or tf-idf representation

Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.