Why not always use the ADAM optimization technique?

35

12

It seems the Adaptive Moment Estimation (Adam) optimizer nearly always works better (faster and more reliably reaching a global minimum) when minimising the cost function in training neural nets.

Why not always use Adam? Why even bother using RMSProp or momentum optimizers?

PyRsquared

Posted 2018-04-15T16:55:34.020

Reputation: 1 316

2I don't believe there is any strict, formalized way to support either statement. It's all purely empirical, as error surface is unknown. As a rule of thumb, and purely from m experience, ADAM does well where others fail (instance segmentation), although not without drawbacks (convergence is not monotone) – Alex – 2018-05-08T08:53:14.933

4Adam is faster to converge. SGD is slower but generalizes better. So at the end it all depends on your particular circumstances. – agcala – 2019-03-21T12:10:21.353

Answers

32

Here’s a blog post reviewing an article claiming SGD is a better generalized adapter than ADAM.

There is often a value to using more than one method (an ensemble), because every method has a weakness.

Christopher Klaus

Posted 2018-04-15T16:55:34.020

Reputation: 446

13

You should also take a look at this post comparing different gradient descent optimizers. As you can see below Adam is clearly not the best optimizer for some tasks as many converge better.

user50386

Posted 2018-04-15T16:55:34.020

Reputation:

2Just for the record: In the linked article they mention some of the flaws of ADAM and present AMSGrad as a solution. However, they conclude that whether AMSGrad outperform ADAM in practices is (at the time of writing) non-conclusive. – Lus – 2019-09-19T11:24:02.573