I'm currently working on implementing Stochastic Gradient Descent,
SGD, for neural nets using back-propagation, and while I understand its purpose I have some questions about how to choose values for the learning rate.
- Is the learning rate related to the shape of the error gradient, as it dictates the rate of descent?
- If so, how do you use this information to inform your decision about a value?
- If it's not what sort of values should I choose, and how should I choose them?
- It seems like you would want small values to avoid overshooting, but how do you choose one such that you don't get stuck in local minima or take to long to descend?
- Does it make sense to have a constant learning rate, or should I use some metric to alter its value as I get nearer a minimum in the gradient?
In short: How do I choose the learning rate for SGD?