## How to prove that gradient descent doesn't necessarily find the global optimum?

1

How can I prove that gradient descent doesn't necessarily find the global optimum?

For example, consider the following function

$$f(x_1, x_2, x_3, x_4) = (x_1 + 10x_2)^2 + 5x_2^3 + (x_2 + 2x_3)^4 + 3x_1x_4^2$$

Assume also that we can't find the optimal value for the learning rate because of time constraints.

1

Well, GD terminates once the gradients are 0, right? Now, in a non-convex function, there could be some points, which do not belong to the global minima, and yet, have 0 gradients. For example, such points can belong to saddle points and local minima.

Consider this picture and say you start GD at the x label.

GD will bring you the flat area and will stop making progress there as gradients are 0. However, as you can see, global minima is to the left of this flat region.

By the same token, you have to show, for your own function, that there exists at least a single point whose gradients are 0 and yet, it is not the global minima.

In addition to that, the guarantee on converge for convex functions depends on annealing the learning rate appropriately. For example, if your LR is too high, GD can just keep overshooting the minima. The visualization from this page might help you to understand more regarding the behavior of GD.

In convex function with an unknown learning rate value (which is not optimal value of that as mentioned) how to ensure that it converges or not? – Mostafa Ghadimi – 2020-03-17T11:44:53.253

If the function is convex, GD converges with a decaying learning rate. I guess if you don't make any assumptions on the learning rate, it could keep overshooting the global minima. Say you're 1 units left of minima but due to big LR, you could go like 3 units right of minima, then back to 1 units left of minima and so on. – SpiderRico – 2020-03-17T11:47:58.487

how to find out whether LR has decaying rate or not? Is there any way to find out the oscillation? – Mostafa Ghadimi – 2020-03-17T23:05:43.903

well in your question, are you allowed to set the lr? if not, you have to look for auxiliary information. for example, if your training loss/function value jumps back and forth, this probably indicates that GD is overshooting its target due to big lr. – SpiderRico – 2020-03-17T23:11:37.863

I'm allowed but can't find the optimal value for lr due to time limit. – Mostafa Ghadimi – 2020-03-17T23:26:10.330

okay then you have to monitor the function value and decrease the learning rate when it starts to overshoot. also look into lr scheduling methods such as line search in general. – SpiderRico – 2020-03-17T23:29:08.150

Sure. I've added a link that visualizes how GD behaves on loss surfaces. You might find it useful. – SpiderRico – 2020-03-18T16:21:01.687

2

You can find by yourself a counterexample that, in general, GD is not guaranteed to find the global optimum!

I first advise you to choose a simpler function (than the one you are showing), with 2-3 optima, where one is the global and the other(s) are local. You don't need neural networks or any other ML concept to show this, but only basic calculus (derivatives) and numerical methods (i.e. gradient descent). Just choose a very simple function with more than one optimum and apply the basic gradient descent algorithm. Then you can see that, if you start gradient descent close to one local optimum (i.e. you choose an initial value for $$x$$ or $$\theta$$, depending on your notation for the variable of the function) and then you apply gradient descent (for some iterations), you will end up in that close local optimum, from which you cannot escape, after having applied the gradient descent steps.