How word2vec can be used to identify unseen words and relate them to already trained data

14

10

I was working on word2vec gensim model and found it really interesting. I am intersted in finding how a unknown/unseen word when checked with the model will be able to get similar terms from the trained model.

Is this possible? Can word2vec be tweaked for this? Or the training corpus needs to have all the words of which i want to find similarity.

gaurus

Posted 2015-12-26T03:47:48.800

Reputation: 321

You may want to try FastText as it can handle unseen words. https://datascience.stackexchange.com/questions/54806/word-embedding-of-a-new-word-which-was-not-in-training

– zfact0rial – 2021-02-18T21:34:22.980

Answers

9

Every algorithm that deals with text data has a vocabulary. In the case of word2vec, the vocabulary is comprised of all words in the input corpus, or at least those above the minimum-frequency threshold.

Algorithms tend to ignore words that are outside their vocabulary. However there are ways to reframe your problem such that there are essentially no Out-Of-Vocabulary words.

Remember that words are simply "tokens" in word2vec. They could be ngrams or they could be letters. One way to define your vocabulary is to say that every word that occurs at least X times is in your vocabulary. Then the most common "syllables" (ngrams of letters) are added to your vocabulary. Then you add individual letters to your vocabulary.

In this way you can define any word as either

  1. A word in your vocabulary
  2. A set of syllables in your vocabulary
  3. A combined set of letters and syllables in your vocabulary

jamesmf

Posted 2015-12-26T03:47:48.800

Reputation: 2 927

3

word2vec treats words as atoms. To get meaningful vectors for unknown words, you either have to

  • change what these atoms are, e.g. switch to letter n-grams as in jamesmf's answer, or
  • use a different model that explicitly looks at what is inside your words, e.g. the CWE model is easy to use.

Joachim Wagner

Posted 2015-12-26T03:47:48.800

Reputation: 221

1https://github.com/facebookresearch/fastText seems to work well – Joachim Wagner – 2017-01-13T16:02:40.650

yea, I tried that but doesn't work well with tasks like morphological segmentation. – gaurus – 2017-02-20T13:10:43.830

3

The training corpus needs to have all the words of which you want to find similarity.

Franck Dernoncourt

Posted 2015-12-26T03:47:48.800

Reputation: 4 975

0

The word2Vec and FastText fail if the word is not in the vocabulary. Throws an error. It gives a list of score for related words But an unseen word will not be in the vocabulary isn't it? So, how does it solve the unseen word problem?

Sam

Posted 2015-12-26T03:47:48.800

Reputation: 1