101

60

I'm currently working on implementing Stochastic Gradient Descent, `SGD`

, for neural nets using back-propagation, and while I understand its purpose I have some questions about how to choose values for the learning rate.

- Is the learning rate related to the shape of the error gradient, as it dictates the rate of descent?
- If so, how do you use this information to inform your decision about a value?
- If it's not what sort of values should I choose, and how should I choose them?
- It seems like you would want small values to avoid overshooting, but how do you choose one such that you don't get stuck in local minima or take to long to descend?
- Does it make sense to have a constant learning rate, or should I use some metric to alter its value as I get nearer a minimum in the gradient?

In short: How do I choose the learning rate for SGD?

1In practice, you will use a learning rate with adadelta. On some problems it does not work without. – bayer – 2014-06-29T18:18:55.930

2

It should be noted that the Adam optimizer is more usual than Adagrad or Adadelta these days.

– E_net4 wants more flags – 2017-09-21T16:24:10.787