## How word2vec can handle unseen / new words to bypass this for new classifications?

7

3

In simple terms, if my classification is based on word2vec as features, what I am supposed to do, if a new word comes, which does not have a word2vec?

I am trying to used word2vec or word vectors for classification based on entity.

For example:

I have to classify the following words in a sentence as:

"Google gives information about Nigeria"


Here, I would like to classify Nigeria as location.

Suppose I have good word2vec vectors for each of the words, based on some readings I came to know that, recurrent neural networks can be used for this. So, word2vec will capture most locations with a kind of similar word vectors.

But my questions are:

a) Suppose a new location is there. lets say, Russia . So, do I need to assign a new word vector for this location ?

b) If my input for training does not have grammatical sense. For example,

" Google information Nigeria " . Everything else Nigeria is associated with a non-location label. Will this condition work for find new location in non-grammatical sentences.

3

One way to do this is to use context information to represent each word along with the w2v vector. You can choose to represent this information in any way you like: add another 600 dimensions (100D w2v vectors for 3 left and 3 right context words), just another 100D as the sum of context vectors or any other fixed length representation for your context.

When you're training, you can use a version of 'word dropout' that will utilize this information. 20% of the time, set your w2v vectors to zero, forcing your classifier to use the context dimensions to represent the word.

When you encounter a new word, the hope is, the classifier learned to use the context information as well as it learned to use the w2v information.

6Interesting...do you have a reference that uses this technique? – Fred – 2016-03-18T15:08:08.517

Bert is similar @Fred – Learning stats by example – 2020-06-30T23:00:44.983

0

Suppose a new location is there. lets say, Russia . So, do I need to assign a new word vector for this location ?

Define an Unknown Word Vector that is going to represent every word that is not in your list.