## How to train a model on top of a transformer to output a sequence?

2

I am using huggingface to build a model that is capable of identifying mistakes in a given sentence. Say I have a given sentence and a corresponding label as follows ->

correct_sentence = "we used to play together."
correct_label = [1, 1, 1, 1, 1]

changed_sentence = "we use play to together."
changed_label = [1, 2, 2, 2, 1]


These labels are further padded with 0s to an equal length of 512. The sentences are also tokenized and are padded up(or down) to this length. The model is as follows:

class Camembert(torch.nn.Module):
"""
The definition of the custom model, last 15 layers of Camembert will be retrained
and then a fcn to 512 (the size of every label).
"""
def __init__(self, cam_model):
super(Camembert, self).__init__()
self.l1 = cam_model
total_layers = 199
for i, param in enumerate(cam_model.parameters()):
if total_layers - i > hparams["retrain_layers"]:
else:
pass
self.l2 = torch.nn.Dropout(hparams["dropout_rate"])
self.l3 = torch.nn.Linear(768, 512)

output = self.l2(output)
output = self.l3(output)
return output


Say, batch_size=2, the output layer will therefore be (2, 512) which is same as the target_label. To the best of my knowledge, this method is like saying there are 512 classes that are to be classified which is not what I want, the problem arises when I try to calculate loss using torch.nn.CrossEntropyLoss() which gives me the following error (truncated):

 File "D:\Anaconda\lib\site-packages\torch\nn\functional.py", line 1838, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), igno
re_index)
RuntimeError: multi-target not supported at C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/p
ytorch_1579082551706/work/aten/src\THCUNN/generic/ClassNLLCriterion.cu:15


How am I supposed to solve this issue, are there any tutorials for similar kinds of models?

1

I think you should treat this problem as a binary classification problem. For each word in the changed sentence, you will have a binary label: correct or incorrect. I would recommend relabeling so that "correct" words will have a label of 0 and "incorrect" words will have a label of 1. In your example you would have:

correct_sentence = "we used to play together"
changed_sentence = "we use play to together"
labels = [0, 1, 1, 1, 0]


And instead of padding with some special value, pad with the "correct" label (which would be 0 if you use my suggestion above).

Conventionally, class labels always start at index 0, so this labeling scheme will match what PyTorch expects for binary classification problems.

Next, you will need to change the activation function for your final Linear layer. Right now, your model ends with just a Linear layer, meaning the output is unbounded. This doesn't really make sense for classification problems, because you know that the output should always be in the range [0, C-1], where C is the number of classes.

Instead, you should apply an activation function to make your outputs behave more like class labels. For a binary classification problem, a good choice for the final activation is torch.nn.Sigmoid. You would modify your model definition like this:

class Camembert(torch.nn.Module):
"""
The definition of the custom model, last 15 layers of Camembert will be retrained
and then a fcn to 512 (the size of every label).
"""
def __init__(self, cam_model):
super(Camembert, self).__init__()
self.l1 = cam_model
total_layers = 199
for i, param in enumerate(cam_model.parameters()):
if total_layers - i > hparams["retrain_layers"]:
else:
pass
self.l2 = torch.nn.Dropout(hparams["dropout_rate"])
self.l3 = torch.nn.Linear(768, 512)
self.activation = torch.nn.Sigmoid()

output = self.l2(output)
output = self.l3(output)
output = self.activation(output)
return output


Your output will now have dimension (batch_size, 512, 1). Each of the 512 outputs will be a number between 0 and 1. You can treat this as the probability of each particular token being "incorrect". If the output is greater than 0.5, then the label becomes "incorrect". Otherwise, the label is "correct".

Finally, since you're treating the problem as a binary classification problem, you will want to use binary cross-entropy loss (torch.nn.BCELoss). Note that you will have to unsqueeze the labels to make their dimension match the dimension of your output.

model = Camembert(cam_model)

input = <tokenized, padded input sequence>
labels = torch.tensor([0, 1, 1, 1, 0, . . .  , 0])
output = model(input)
loss = criterion(output, labels.unsqueeze(1))