Should you turn off label smoothing when validating?


As the subject says. On one hand, the answer should be yes because label smoothing is a regularization feature and how can you know if it improves performance without turning it off? On the other hand, I haven't seen any authoritative source claiming that it should be turned off during validation, not even the article that introduced the technique mentions it. And afaict, as the feature is implemented in Keras it is not turned off during validation.

Björn Lindqvist

Posted 2020-06-23T01:41:30.217

Reputation: 141



The way most people gain an initial understanding of label smoothing (and what most common explanations have to say on the subject) plays a great role in how one would approach this question.

At first glance, label smoothing is exactly what the name suggests: we modify the labels or some portion of them in order to get a better, more general, more robust model. It makes sense that we don't want the model to learn from (and later predict with) extreme confidence levels, especially when we know that some of the labels are wrong, since this hurts the model's abilities to perform on unseen data. The intuitive explanation of LS mechanics, then, is that we are not feeding the model with pure 1's and 0's, but with less confident values instead, resulting in a more reserved decision function that doesn't extrapolate in an extreme manner. Now, we know that smoothed labels are not the true labels, so at this point the main concern pops up - once we have trained on smoothed labels, do we also use smoothed labels for validation?

If you think about LS purely as a data manipulation technique, the answer to the question above is not obvious, as it may go both ways depending on the argument. However, one must remember that LS is almost always regarded as a regularization technique -- you mention this in the question yourself -- and there is a good reason for that. Regularization, by definition, is when the loss function is extended with an additional regularization term, which usually has to do with penalization. In LS, this penalty term is responsible for punishing high confidence predictions. Even though it may not appear as such, LS, once applied, becomes an essential part of the loss function, which should persist between training and validation if we aim to take advantage of the technique. When we apply LS during training, we are effectively trying to minimize a loss function with the regularization term added in. Throwing it out during validation would defeat the very purpose of including it in the first place: if we decide not to apply LS to the validation set as well, we are making the mistake of expecting extreme confidence (1/0 labels) from a learner that was just a moment ago specifically trained against making overly confident predictions on previously unseen data. The correct thing to do would be to validate that the predictions are of moderate confidence, just as desired. This is why the validation set should also have the regularizer present in the loss function, i.e. have smoothed labels as well.


Posted 2020-06-23T01:41:30.217

Reputation: 500

Makes sense but then how would I know if label smoothing improved my model? With LS turned on I'm likely to get different and perhaps even worse values for validation loss and accuracy. I don't want to use LS if I can't say that it makes my model better. – Björn Lindqvist – 2020-06-23T20:55:57.433

I would suggest comparing two models (one with LS, another without it) by testing their accuracy/precision/recall/AUC/whatever on a holdout set. That way you'd be looking at an objective performance metric. – Vlad_Z – 2020-06-23T23:10:44.363

Isn't that more or less what is done with the validation dataset? – Björn Lindqvist – 2020-06-24T05:26:18.390

Not quite. In the beginning, we use training and validation sets to first teach the model and then make sure that it learned in the right way. We usually build multiple models over training+validation data, as we try different techniques, tune parameters, etc. The test dataset (also called holdout) is used only at the very end to assess unbiased performance of the final model(s) -- this is the data than never affected the models being tested nor (and this is equally important) our previous decisions which lead to having these models as final candidates. – Vlad_Z – 2020-06-24T06:32:02.553


AFAIK label smoothing comes into picture while calculating the loss while training. There is no loss computation during validation.


Posted 2020-06-23T01:41:30.217

Reputation: 1

Sure there is. You're thinking about testing perhaps. – Björn Lindqvist – 2020-06-23T02:19:51.747


Label Smoothing is a regularizer technique that is applied to target value so that the model can learn the data well without overfitting. There is no need to do label smoothing for validation.But even if you do it, it won't be problem.


Posted 2020-06-23T01:41:30.217

Reputation: 487