Why not always use the ADAM optimization technique?



It seems the Adaptive Moment Estimation (Adam) optimizer nearly always works better (faster and more reliably reaching a global minimum) when minimising the cost function in training neural nets.

Why not always use Adam? Why even bother using RMSProp or momentum optimizers?


Posted 2018-04-15T16:55:34.020

Reputation: 1 316

2I don't believe there is any strict, formalized way to support either statement. It's all purely empirical, as error surface is unknown. As a rule of thumb, and purely from m experience, ADAM does well where others fail (instance segmentation), although not without drawbacks (convergence is not monotone) – Alex – 2018-05-08T08:53:14.933

4Adam is faster to converge. SGD is slower but generalizes better. So at the end it all depends on your particular circumstances. – agcala – 2019-03-21T12:10:21.353



Here’s a blog post reviewing an article claiming SGD is a better generalized adapter than ADAM.

There is often a value to using more than one method (an ensemble), because every method has a weakness.

Christopher Klaus

Posted 2018-04-15T16:55:34.020

Reputation: 446


You should also take a look at this post comparing different gradient descent optimizers. As you can see below Adam is clearly not the best optimizer for some tasks as many converge better.


Posted 2018-04-15T16:55:34.020


2Just for the record: In the linked article they mention some of the flaws of ADAM and present AMSGrad as a solution. However, they conclude that whether AMSGrad outperform ADAM in practices is (at the time of writing) non-conclusive. – Lus – 2019-09-19T11:24:02.573