Can BERT do the next-word-predict task?

15

2

As BERT is bidirectional (uses bi-directional transformer), is it possible to use it for the next-word-predict task? If yes, what needs to be tweaked?

不是phd的phd

Posted 2019-02-28T08:37:42.190

Reputation: 491

Have you seen the original publication? It seems to be addressing prediction at the sentence level, as explained in its section 3.3.2.

– mapto – 2019-02-28T09:07:55.377

Consider a related discussion on GitHub.

– mapto – 2019-03-13T10:04:36.463

Answers

18

BERT can't be used for next word prediction, at least not with the current state of the research on masked language modeling.

BERT is trained on a masked language modeling task and therefore you cannot "predict the next word". You can only mask a word and ask BERT to predict it given the rest of the sentence (both to the left and to the right of the masked word).

This way, with BERT you can't sample text like if it were a normal autoregressive language model. However, BERT can be seen as a Markov Random Field Language Model and be used for text generation as such. See article BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model for details. The authors released source code and a Google Colab notebook.

Update: the authors of the MRF article discovered their analysis was flawed and BERT is not a MRF, see this

Update 2: despite not being meant for next word prediction, there have been attempts at using BERT that way. Here you can find a project that does next word prediction with BERT, XLNet, RoBERTa, etc.

noe

Posted 2019-02-28T08:37:42.190

Reputation: 10 494

2Why can't you just control the mask to be the last word in the sequence? Then use BERT to predict based the masked token (next word). I'm still digesting these results so I can't guide how to implement. Still, seems like a plausible approach. – Sledge – 2019-03-05T21:40:26.450

This was tried by some guy on the several twitter discussions about BERT after it was released and he confirmed that BERT failed with the approach @Sledge is describing.If you cut a sentence and ask BERT to predict the next word, it will not be able to use the right-hand part of the sentence, which it needs to perform the prediction. – noe – 2019-03-06T13:55:40.370

I see, @ncasas thanks for the explanation. – Sledge – 2019-03-06T15:12:29.820

@ncasas May I ask if this statement is (BERT not suitable for next word prediction) is also true for transformer architecture at general? Because I got impression that gpt utilize next token prediction, but I'm not completely sure. – viceriel – 2020-12-21T14:09:21.387

No, the Transformer architecture was originally meant for text generation. GPT-2 is a traditional Transformer decoder trained on a causal language model loss, and therefore it is suitable for text generation. On the other hand, BERT is trained on a masked language model loss, and that is why it is not meant for text generation.

– noe – 2020-12-21T14:37:28.220

@Sledge I updated the answer with a github repo that attempts to do next word prediction with BERT. – noe – 2021-02-19T11:31:12.760