Early stopping on validation loss or on accuracy?

22

4

I am currently training a neural network and I cannot decide which to use to implement my Early Stopping criteria: validation loss or a metrics like accuracy/f1score/auc/whatever calculated on the validation set.

In my research, I came upon articles defending both standpoints. Keras seems to default to the validation loss but I have also come across convincing answers for the opposite approach (e.g. here).

Anyone has directions on when to use preferably the validation loss and when to use a specific metric?

qmeeus

Posted 2018-08-20T12:22:25.053

Reputation: 726

Answers

12

TLDR; Monitor the loss rather than the accuracy

I will answer my own question since I think that the answers received missed the point and someone might have the same problem one day.

First, let me quickly clarify that using early stopping is perfectly normal when training neural networks (see the relevant sections in Goodfellow et al's Deep Learning book, most DL papers, and the documentation for keras' EarlyStopping callback).

Now, regarding the quantity to monitor: prefer the loss to the accuracy. Why? The loss quantify how certain the model is about a prediction (basically having a value close to 1 in the right class and close to 0 in the other classes). The accuracy merely account for the number of correct predictions. Similarly, any metrics using hard predictions rather than probabilities have the same problem.

Obviously, whatever metrics you end up choosing, it has to be calculated on a validation set and not a training set (otherwise, you are completely missing the point of using EarlyStopping in the first place)

qmeeus

Posted 2018-08-20T12:22:25.053

Reputation: 726

If values are between 0 and 1, cross_entropy loss is a more preferable candidate than MSE or MAE. Checkout the Wrap-Up section of this article, and this post on stats.

– Esmailian – 2019-04-19T16:16:43.357

1@Esmailian it is not a matter of preference; for classification problems, MSE & MAE are simply not appropriate. – desertnaut – 2019-09-02T17:22:25.370

6

In my opinion, this is subjective and problem specific. You should use whatever is the most important factor in your mind as the driving metric, as this might make your decisions on how to alter the model better focussed.

Most metrics one can compute will be correlated/similar in many ways: e.g. if you use MSE for your loss, then recording MAPE (mean average percentage error) or simple $L_1$ loss, they will give you comparable loss curves.

For example, if you will report an F1-score in your report/to your boss etc. (and assuming that is what they really care about), then using that metric could make most sense. The F1-score, for example, takes precision and recall into account i.e. it describes the relationship between two more fine-grained metrics.

Bringing those things together, computing scores other than normal loss may be nice for the overview and to see how your final metric is optimised over the course of the training iterations. That relationship could perhaps give you a deeper insight into the problem,

It is usually best to try several options, however, as optimising for the validation loss may allow training to run for longer, which eventually may also produce a superior F1-score. Precision and recall might sway around some local minima, producing an almost static F1-score - so you would stop training. If you had been optimising for pure loss, you might have recorded enough fluctuation in loss to allow you to train for longer.

n1k31t4

Posted 2018-08-20T12:22:25.053

Reputation: 12 573

Why would using validation loss allow for training longer than using a metrics? Also, can you elaborate on the difference between the two options? Do you see a case where it would be a bad idea to use a metric rather than the loss? – qmeeus – 2018-08-20T14:43:42.527

@id-2205 - please see my edited answer. – n1k31t4 – 2018-08-20T17:00:03.280

interesting point! I am currently using accuracy for early stopping but I will try to use the validation loss. I don't expect any changes in the training process though ! Thank you for your answer – qmeeus – 2018-08-21T19:30:42.983

1

I am currently training a neural network and I cannot decide which to use to implement my Early Stopping criteria: validation loss or a metrics like accuracy/f1score/auc/whatever calculated on the validation set.

If you are training a deep network, I highly recommend you not to use early stop. In deep learning, it is not very customary. Instead, you can employ other techniques like drop out for generalizing well. If you insist on that, choosing criterion depends on your task. If you have unbalanced data, you have to employ F1 score and evaluate it on your cross-validation data. If you have balanced data, try to use accuracy on your cross-validation data. Other techniques highly depend on your task.

I highly encourage you to find a model which fits your data very well and employ drop out after that. This is the most customary thing people use for deep models.

Media

Posted 2018-08-20T12:22:25.053

Reputation: 12 077

3I am using dropout as well. However, I can't find a reason why early stopping should not be used though... – qmeeus – 2018-08-21T19:21:22.957

Early stop tries to solve both learning and generalization problems. On the other hand drop out just tries to overcome the generalization problem. – Media – 2018-08-21T19:35:58.747

1You don't answer my question... I don't deny the fact that dropout is useful and should be used to protect against overfitting, I couldn't agree more on that. My question is: why do you say that early stop should not be used with ANN? (cf your first sentence: If you are training a deep network, I highly recommend you not to use early stop.) – qmeeus – 2018-08-21T20:22:41.167

Did you read my last comment? It exactly answers your question. It's a famous quote from pr. Ng in his deep learning class, second course. The latter case is an easier task due to not struggling to solve multple tasks simoltaneously. – Media – 2018-08-21T21:43:44.363

interesting! In my experience, I found early stopping quite useful because you don't loose time and resources on models that do not generalize. Even using dropout does not make a poorly configured model generalize better. If doing some kind of automated grid search, it is often useful to have a mechanism to stop the training if necessary without human intervention. – qmeeus – 2018-08-22T07:35:11.747

As I referred in the answer, I highly encourage you to find a model which fits your data very well and employ drop out after that. – Media – 2018-08-22T08:23:10.047

3And in order to find it and find the right set of hyperparameters, I'm employing some kind of directed grid search with early stop for the reasons I explained above. Point taken though and once I have selected the final model and I will train it, I will not use early stop. Thank you for this interesting discussion and for you advice – qmeeus – 2018-08-22T11:37:42.243

Early stop is an ugly way to solve overfitting. It's ugly because you are very dependent on the luck of your initialization and learning algorithm. Besides if you change any detail about those, the impact will be huge. If overfitting is a problem, it is wiser to just reduce the size of the neural network or use other regularization techniques. – Ricardo Cruz – 2018-08-23T21:26:19.207

@RicardoCruz I do agree sir. – Media – 2018-08-24T13:28:57.930

From Deep Learning (Goodfellow et al. available at http://www.deeplearningbook.org) in 11.2: "(...) you should include some mild forms of regularization from the start. Early stopping should be used almost universally. Dropout is an excellent regularizer that is easy to implement (...). Batch normalization also sometimes reduces generalization error (...)." In any case, early stopping or not, how good a model is must not depend on a lucky initialisation". However, choosing the right initialisation (see Glorot) method and learning algo plays a role indeed.

– qmeeus – 2018-09-19T13:30:00.410

It highly depends on your task, you should include some mild forms of regularization from the start, at least in based on my experience. I prefer not to do that for regression tasks where there are numerous real value outputs. – Media – 2018-09-21T18:24:53.283