Why is word prediction an obsession in Natural Language Processing?



I have heard how great BERT is at masked word prediction, i.e. predicting a missing word from a sentence.

In a Medium post about BERT, it says:

The basic task of a language model is to predict words in a blank, or it predicts the probability that a word will occur in that particular context. Let’s take another example:

“FC Barcelona is a _____ club”

Indeed, I recently heard about SpanBERT, which is "designed to better represent and predict spans of text".

What I do not understand is: why?

  1. I cannot think of any common reason that a human would need to do this task, let alone why it would need to be automated.
  2. This does not even seem to be a task where it is particularly easy to evaluate the success of a model. For example,

My ___ is cold

This could reasonably be a number of possible words. How can BERT be expected to get this right, and how can humans or another algorithm be expected to evaluate whether "soup" is a better answer than "coffee"?

Clearly there are a lot of smart people who think that this is important, so I accept my lack of understanding is likely based on my own ignorance. Is it that this task itself is not important, but it's a proxy for ability at other tasks?

What am I missing?


Posted 2019-10-16T14:52:38.063

Reputation: 73



The first line of the BERT abstract is

We introduce a new language representation model called BERT.

The key phrase here is "language representation model". The purpose of BERT and other natural language processing models like Word2Vec is to provide a vector representation of words, so that the vectors can be used as input to neural networks for other tasks.

There are two concepts to grasp about this field; vector representations of words and transfer learning. You can find a wealth of information about either of these topics online, but I will give a short summary.

How can BERT be expected to get this right, and how can humans or another algorithm be expected to evaluate whether "soup" is a better answer than "coffee"?

This ambiguity is the strength of word prediction, not a weakness. In order for language to be fed into a neural network, words somehow have to be converted into numbers. One way would be a simple categorical embedding, where the first word 'a' gets mapped to 1, the second word 'aardvark' gets mapped to 2, and so on. But in this representation, words of similar meaning will not be mapped to similar numbers. As you've said, "soup" and "coffee" have similar meanings compared to all English words (they are both nouns, liquids, types of food/drink normally served hot, and therefore can both me valid predictions for the missing word), so wouldn't it be nice if their numerical representations were also similar to each other?

This is the idea of vector representation. Instead of mapping each word to a single number, each word is mapped to a vector of hundreds of numbers, and words of similar meanings will be mapped to similar vectors.

The second concept is transfer learning. In many situations, there is only a small amount of data for the task that you want to perform, but there is a large amount of data for a related, but less important task. The idea of transfer learning is to train a neural network on the less important task, and to apply the information that was learned to the other task that you actually care about.

As stated in the second half of the BERT abstract,

...the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

To summarize, the answer to your question is that people DON'T care about the masked word prediction task on its own. The strength of this task is that there is a huge amount of data readily and freely available to train on (BERT used the entirety of wikipedia, with randomly chosen masks), and the task is related to other natural language processing tasks that require interpretations of the meaning of words. BERT and other language representation models learn vector embeddings of words, and through transfer learning this information is passed to whatever other downstream task that you actually care about.

Brady Gilg

Posted 2019-10-16T14:52:38.063

Reputation: 298

This is really helpful - thanks. So if I understand you correctly, you are saying that the masked word prediction is used to generate the word vectors, so that similar words are close to one another? And then it's those vectors (or embeddings) that we care about? OK that makes sense. Thanks! – SamR – 2019-10-18T09:36:35.730