How to train a model on top of a transformer to output a sequence?

2

I am using huggingface to build a model that is capable of identifying mistakes in a given sentence. Say I have a given sentence and a corresponding label as follows ->

correct_sentence = "we used to play together."
correct_label = [1, 1, 1, 1, 1]

changed_sentence = "we use play to together."
changed_label = [1, 2, 2, 2, 1]

These labels are further padded with 0s to an equal length of 512. The sentences are also tokenized and are padded up(or down) to this length. The model is as follows:

class Camembert(torch.nn.Module):
    """
    The definition of the custom model, last 15 layers of Camembert will be retrained
    and then a fcn to 512 (the size of every label).
    """
    def __init__(self, cam_model):
        super(Camembert, self).__init__()
        self.l1 = cam_model
        total_layers = 199
        for i, param in enumerate(cam_model.parameters()):
            if total_layers - i > hparams["retrain_layers"]:
                param.requires_grad = False
            else:
                pass
        self.l2 = torch.nn.Dropout(hparams["dropout_rate"])
        self.l3 = torch.nn.Linear(768, 512)

    def forward(self, ids, mask):
        _, output = self.l1(ids, attention_mask=mask)
        output = self.l2(output)
        output = self.l3(output)
        return output

Say, batch_size=2, the output layer will therefore be (2, 512) which is same as the target_label. To the best of my knowledge, this method is like saying there are 512 classes that are to be classified which is not what I want, the problem arises when I try to calculate loss using torch.nn.CrossEntropyLoss() which gives me the following error (truncated):

 File "D:\Anaconda\lib\site-packages\torch\nn\functional.py", line 1838, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), igno
re_index)
RuntimeError: multi-target not supported at C:/w/1/s/tmp_conda_3.7_100118/conda/conda-bld/p
ytorch_1579082551706/work/aten/src\THCUNN/generic/ClassNLLCriterion.cu:15

How am I supposed to solve this issue, are there any tutorials for similar kinds of models?

IE Irodov

Posted 2020-10-30T11:37:22.500

Reputation: 173

Answers

1

I think you should treat this problem as a binary classification problem. For each word in the changed sentence, you will have a binary label: correct or incorrect. I would recommend relabeling so that "correct" words will have a label of 0 and "incorrect" words will have a label of 1. In your example you would have:

correct_sentence = "we used to play together"
changed_sentence = "we use play to together"
labels = [0, 1, 1, 1, 0]

And instead of padding with some special value, pad with the "correct" label (which would be 0 if you use my suggestion above).

Conventionally, class labels always start at index 0, so this labeling scheme will match what PyTorch expects for binary classification problems.


Next, you will need to change the activation function for your final Linear layer. Right now, your model ends with just a Linear layer, meaning the output is unbounded. This doesn't really make sense for classification problems, because you know that the output should always be in the range [0, C-1], where C is the number of classes.

Instead, you should apply an activation function to make your outputs behave more like class labels. For a binary classification problem, a good choice for the final activation is torch.nn.Sigmoid. You would modify your model definition like this:

class Camembert(torch.nn.Module):
    """
    The definition of the custom model, last 15 layers of Camembert will be retrained
    and then a fcn to 512 (the size of every label).
    """
    def __init__(self, cam_model):
        super(Camembert, self).__init__()
        self.l1 = cam_model
        total_layers = 199
        for i, param in enumerate(cam_model.parameters()):
            if total_layers - i > hparams["retrain_layers"]:
                param.requires_grad = False
            else:
                pass
        self.l2 = torch.nn.Dropout(hparams["dropout_rate"])
        self.l3 = torch.nn.Linear(768, 512)
        self.activation = torch.nn.Sigmoid()

    def forward(self, ids, mask):
        _, output = self.l1(ids, attention_mask=mask)
        output = self.l2(output)
        output = self.l3(output)
        output = self.activation(output)
        return output

Your output will now have dimension (batch_size, 512, 1). Each of the 512 outputs will be a number between 0 and 1. You can treat this as the probability of each particular token being "incorrect". If the output is greater than 0.5, then the label becomes "incorrect". Otherwise, the label is "correct".

Finally, since you're treating the problem as a binary classification problem, you will want to use binary cross-entropy loss (torch.nn.BCELoss). Note that you will have to unsqueeze the labels to make their dimension match the dimension of your output.

model = Camembert(cam_model)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

input = <tokenized, padded input sequence>
labels = torch.tensor([0, 1, 1, 1, 0, . . .  , 0])
output = model(input)
loss = criterion(output, labels.unsqueeze(1))

optimizer.zero_grad()
loss.backward()
optimizer.step()

zachdj

Posted 2020-10-30T11:37:22.500

Reputation: 1 816

1This works well! I'll wait 24 hours before accepting this, but this solved the issue at hand. Are there any similar cases where one cannot reframe the problem into a classification? – IE Irodov – 2020-10-31T02:46:35.383

There might be, depending on how much you stretch the meaning of "similar". One similar case would be not only identifying which tokens are erroneous, but also identifying which type of error they are. In that case, you might have many classes: correct, misspelling, word transposition, grammatical error, etc. It would still be a classification problem, but it would be multi-class rather than binary.

I can't think of a similar problem where classification/prediction is not involved. Generally, the alternative to classification is regression, and regression problems are not common in NLP. – zachdj – 2020-11-03T02:48:13.267

I guess another similar problem would be error correction. In this case, the model would seek to replace each incorrect token with another token from the vocabulary. This is technically still a classification/prediction problem, but the number of classes = size of vocabulary. – zachdj – 2020-11-03T02:49:15.433