## How label smoothing and label flipping increases the performance of a machine learning model

2

1

I have seen posts and research papers mentioning these techniques for improving the performance of a machine learning model.

These techniques certainly make some sense in the case when we are not sure about how correct is the labelling our dataset.

However, I am wondering if these two techniques are proven to be beneficial for a ML model in the case when the labelling is correct?

5

Label flipping is a training technique where one selectively manipulates the labels in order to make the model more robust against label noise and associated attacks - the specifics depend a lot on the nature of the noise. Label flipping bears no benefit only under the assumption that all labels are (and will always be) correct and that no adversaries exist. In cases where noise tolerance is desirable, training with label flipping is beneficial.

Label smoothing is a regularization technique (and then some) aimed at improving model performance. Its effect takes place irrespective of label correctness.

Without label smoothing, a softmax classifier is trained to make infinitely confident predictions on the training set. This encourages the model to learn large weights and strong responses. When values are pushed outside the areas where training data concentrates, the model makes even more extreme predictions when extrapolating linearly. Label smoothing penalizes the model for making overly confident predictions on the training set, forcing it to learn either a more non-linear function or a linear function with smaller slope. Extrapolations by the label-smoothed model are consequently less extreme.

Confident predictions correspond to output distributions that have low entropy. A network is over-confident when it places all probability on a single class in the training set, which is often a symptom of overfitting. The confidence penalty constitutes a regularization term that prevents these peaked distributions, leading to better generalization.

As a result of label smoothing, the model becomes more robust in general. Its increased ability to deal with incorrect labels is just part of the overall improvement. However, one cannot claim that the effects of label smoothing are purely beneficial.

Despite having a positive effect on generalization and calibration, label smoothing can hurt distillation. We explain this effect in terms of erasure of information. With label smoothing, the model is encouraged to treat each incorrect class as equally probable. With hard targets, less structure is enforced in later representations, enabling more logit variation across predicted class and/or across examples. This can be quantified by estimating mutual information between input example and output logit and, as we have shown, label smoothing reduces mutual information.

0

Suppose you have a language model trained to predict the next word. One sample in your training data is

hello, how, are, you


so that the input is the three words "hello, how, are" and the output the word "you". Without label smoothing, you would be telling the network $$P(\mathrm{you}|\textrm{hello},\textrm{how},\textrm{are}) = 1.0$$ That is, "you" will always follow the three word "hello, how, are".

That is wrong. There are hundreds of words that could follow "hello, how, are" (e.g "hello how are they").

In this case smoothing the labels means that the network is given better data.