Orange is training loss, blue is dev loss. As you can see, the training loss is lower than the dev loss, so I figured: I have (reasonably) low bias and high variance, which means I'm overfitting, so I should add some regularization: dropout, L2 regularization and data augmentation. After that, I get a plot like this:
Now we see that the variance has decreased and the bias has increased. The model is overfitting less, is this correct? However, I would actually select the first model because it has lower validation loss.
My question is: in most literature, for the bias-variance tradeoff, they show the validation loss going up, but in my experiments this is not the case, so are these models actually overfitting? Are you overfitting as soon as training loss goes below validation loss, or only if validation loss goes back up? And is it okay to choose a model with high variance if the validation loss is lower?
I found this answer on a similar question, but what if your problem is so complex that you can't find an architecture that can overfit and then properly regularize the architecture? I can find an architecture that gets a training loss close(r) to zero, but then I can't really add enough dropout to make sure the variance is low. Also if I add augmentation my validation loss also goes up. Finally the answer confuses me, the answerer is talking about variance on the training set? But isn't bias always related to training loss and variance to dev loss?
Or am I just misinterpreting information and should I plot in function of dataset size rather than number of epochs to find out if I am overfitting?