I have been using neural networks for a while now. However, one thing that I constantly struggle with is the selection of an optimizer for training the network (using backprop). What I usually do is just start with one (e.g. standard SGD) and then try other others pretty much randomly. I was wondering if there's a better (and less random) approach to finding a good optimizer, e.g. from this list:
- SGD (with or without momentum)
In particular, I am interested if there's some theoretical justification for picking one over another given the training data has some property, e.g. it being sparse. I would also imagine that some optimizers work better than others in specific domains, e.g. when training convolutional networks vs. feed-forward networks or classification vs. regression.
If any of you have developed some strategy and/or intuition on how you pick optimizers, I'd be greatly interested in hearing it. Furthermore, if there's some work that provides theoretical justification for picking one over another, that would be even better.