Feeding XLM-R embeddings to neural machine translation?

0

I’m very new to the field of deep learning. My aim is to make a translation between Catalan to Catalan Sign Language. The grammar of the two languages is different

Input: He sells food. Output (sign language sentence): Food he sells.

I've been playing around with XLM-R and go the token id like this

input Ids: [200, 100, 2003, 1037, 3835, 3351, 5012, 300, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I don't know how to use the embeddings in Sequence to Sequence NMT model. or any other means to do machine translation with a very small data set. The language is low resource language

import torch
from transformers import XLMRobertaModel, XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
model = XLMRobertaModel.from_pretrained('xlm-roberta-large')

def get_ids(tokens, tokenizer, max_seq_length):
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = token_ids + [0] * (max_seq_length - len(token_ids))
return input_ids

s = "test sentence"
stokens = tokenizer.tokenize(s)
print(stokens)
stokens = ["[CLS]"] + stokens + ["[SEP]"]
input_ids = get_ids(stokens, tokenizer, 15)

print(tokenizer.convert_tokens_to_ids(['test']))
print(tokenizer.convert_tokens_to_ids(['▁test']))
print(tokenizer.convert_ids_to_tokens([26130]))
print(tokenizer.convert_ids_to_tokens([30521]))
tokens_tensor = torch.tensor([input_ids])
print(input_ids)
print(tokens_tensor)
```

NLP Dude

Posted 2020-03-18T08:19:33.093

Reputation: 31

Answers

1

You only got the indices in the XLM-R's vocabulary. This is the input to XLM-R, you need to actually run the model. By calling

model(tokens_tensor)

you get a tuple of tensors that are outputs of the model. Check the documentation for the outputs are.

Jindřich

Posted 2020-03-18T08:19:33.093

Reputation: 888