[Note 5 April 2019: A new version of the paper has been updated on arXiv with many new results. We introduce also backtracking versions of Momentum and NAG, and prove convergence under the same assumptions as for Backtracking Gradient Descent.
Source codes are available on GitHub at the link.
We improved the algorithms for applying to DNN, and obtain better performance than state-of-the-art algorithms such as MMT, NAG, Adam, Adamax, Adagrad,...
The most special feature of our algorithms are that they are automatic, you do not need to do manual fine-tuning of learning rates as common practice. Our automatic fine-tuning is different in nature from Adam, Adamax, Adagrad,... and so on. More details are in the paper.
Based on very recent results: In my joint work in this paper
We showed that backtracking gradient descent, when applied to an arbitrary C^1 function $f$, with only a countable number of critical points, will always either converge to a critical point or diverge to infinity. This condition is satisfied for a generic function, for example for all Morse functions. We also showed that in a sense it is very rare for the limit point to be a saddle point. So if all of your critical points are non-degenerate, then in a certain sense the limit points are all minimums. [Please see also references in the cited paper for the known results in the case of the standard gradient descent.]
Based on the above, we proposed a new method in deep learning which is on par with current state-of-the-art methods and does not need manual fine-tuning of the learning rates. (In a nutshell, the idea is that you run backtracking gradient descent a certain amount of time, until you see that the learning rates, which change with each iteration, become stabilise. We expect this stabilisation, in particular at a critical point which is C^2 and is non-degenerate, because of the convergence result I mentioned above. At that point, you switch to the standard gradient descent method. Please see the cited paper for more detail. This method can also be applied to other optimal algorithms.)
P.S. Regarding your original question about the standard gradient descent method, to my knowledge only in the case where the derivative of the map is globally Lipschitz and the learning rate is small enough that the standard gradient descent method is proven to converge. [If these conditions are not satisfied, there are simple counter-examples showing that no convergence result is possible, see the cited paper for some.] In the paper cited above, we argued that in the long run the backtracking gradient descent method will become the standard gradient descent method, which gives an explanation why the standard gradient descent method usually works well in practice.