Can I fine-tune BERT, ELMO or XLnet for Seq2Seq neural machine translation?



I'm working on neural machine translator that translates English sentences to American sign language sentences(e.g below). I've a quite small dataset - around 1000 sentence pairs. I'm wondering if it is possible to fine-tune BERT, ELMO or XLnet for Seq2seq encoder/decoder machine translation.

English: He sells food.

American sign language: Food he sells

NLP Dude

Posted 2020-02-24T08:40:38.953

Reputation: 31

I don't know American sign language, but is it only about word reordering ? – Astariul – 2020-02-25T03:33:20.453

Not only reordering. the grammar is a bit different than English. English relies on Subject-Verb-Object (SVO) sentence structure, while ASL more frequently uses Topic-Comment structure. – NLP Dude – 2020-02-25T06:48:48.980

1Then it seems to be very similar to Translation task. Bert is only an encoder, so it cannot be used alone for Seq2Seq tasks, but it's definitely possible to add a decoder and use Bert as encoder. Or simply used seq2seq architecture such as BART. – Astariul – 2020-02-25T07:18:29.400

Thank you so much, It's very helpful. one last question, Is it possible to do something similar with and use BERT as an encoder.

– NLP Dude – 2020-02-25T11:20:42.633



You can view models like ELMo or BERT to be encoder-only. They can be easily used for classification or sequence tagging, but the tag sequence is typically monotonically aligned with the source sequence. Even though the Transformer layers in BERT or XLNet are in theory capable of arbitrary reordering (which is used in non-autoregressive machine translation models), this is not what BERT or XLNet were trained for and therefore it will be hard to finetune for that.

If at least the vocabulary is the same on both the source and target side, I would recommend pre-trained sequence-to-sequence models: MASS or BART.

If the both the grammar and vocabulary and grammar of the sign language are quite different, maybe using BERT as an encoder and training your own lightweight autoregressive decoder might be the correct way.


Posted 2020-02-24T08:40:38.953

Reputation: 888

Thank you so much, It's very helpful. one last question, Is it possible to do something similar with and use BERT as an encoder. I'm thinking pretrain BERT from scratch and to do this for low resource language rather than English to ASL.

– NLP Dude – 2020-02-25T11:20:49.173

Yes, it should be possible. However, I would consider using a Transformer decoder instead of GRU, but both should work. – Jindřich – 2020-02-25T13:27:50.940

Thank you so much, I really appreciate it. – NLP Dude – 2020-02-25T13:41:10.437

I was researching on cross-lingual approach and I've found that Facebook AI research team released XLM-R which is trained in one language and  used with other languages. I would like to ask you If It's possible to finetune XLM-R for the task we discussed. – NLP Dude – 2020-02-28T09:14:53.997

Couldn't you use BERT's next sentence prediction task as a basis on which to fine-tune your seq2seq downstream task? – npit – 2020-09-21T02:17:13.150