Word2Vec embeddings with TF-IDF



When you train the word2vec model (using for instance, gensim) you supply a list of words/sentences. But there does not seem to be a way to specify weights for the words calculated for instance using TF-IDF.

Is the usual practice to multiply the word vector embeddings with the associated TF-IDF weight? Or can word2vec organically take advantage of these somehow?


Posted 2018-03-04T12:07:33.313

Reputation: 231



Word2Vec algorithms (Skip Gram and CBOW) treat each word equally, because their goal to compute word embeddings. The distinction becomes important when one needs to work with sentences or document embeddings: not all words equally represent the meaning of a particular sentence. And here different weighting strategies are applied, TF-IDF is one of them, and, according to some papers, is pretty successful. From this question from StackOverflow:

In this work, tweets were modeled using three types of text representation. The first one is a bag-of-words model weighted by tf-idf (term frequency - inverse document frequency) (Section 2.1.1). The second represents a sentence by averaging the word embeddings of all words (in the sentence) and the third represents a sentence by averaging the weighted word embeddings of all words, the weight of a word is given by tf-idf (Section 2.1.2).


Posted 2018-03-04T12:07:33.313

Reputation: 770


Train a tfidfvectorizer with your corpus and use the following code:

tfidf = Tfidfvectorizer () dict(zip(tfidf.get_feature_names(), tfidf.idf_)))

Now you have a dictionary with words as its keys and weights as the corresponding values.

Let me know if it worked.

Aayush Shrivastav

Posted 2018-03-04T12:07:33.313

Reputation: 41

Yes It Does. Thanks for your help. – Tanveer – 2020-02-06T12:46:01.843