Do we really need `<unk>` tokens?



I am wondering, do we really need <unk> tokens? Why do we limit our vocabulary?

Is it for speed? Accuracy?

If we disable all limitations, what do you predict happens?

A. Dandelion

Posted 2018-06-20T21:46:57.343

Reputation: 135

Please give a little more context. What are you using these tokens for? What do they replace? Any links to example usages or literature? – n1k31t4 – 2018-06-20T21:57:33.847

On sequence2sequence models a vocabulary is commonly used. This vocabulary though for some reason is limited and any word which is not in the vocabulary gets replaced with <unk>. – A. Dandelion – 2018-06-20T21:59:38.383



The <unk> tags can simply be used to tell the model that there is stuff, which is not semantically important to the output. This is a choice made via the selection of a dictionary. If the word is not in the dictionary we have chosen, then we are saying we have no valid representation for that word (or we are simply not interested).

Other tags are commonly used to groups such thing together, not only (j)unk.

For example <EMOJI> might replace any token that is found in our list of deined emojis. We are keeping some information, i.e. that there is a symbol representing emotion of some kind, but we are neglecting exactly which emotion. You can think of many more examples where this might be helpful, or you just don't have the right (labelled) data to make the most of the contents semantically.


Posted 2018-06-20T21:46:57.343

Reputation: 12 573