51

24

What is the difference between Gradient Descent and Stochastic Gradient Descent?

I am not very familiar with these, can you describe the difference with a short example?

51

24

What is the difference between Gradient Descent and Stochastic Gradient Descent?

I am not very familiar with these, can you describe the difference with a short example?

55

For a quick simple explanation:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.

If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes

Source: Quora Thread

8

The inclusion of the word **stochastic** simply means the *random* samples from the training data are chosen in each run to update parameter during optimisation, within the framework of **gradient descent**.

Doing so not only computed errors and updates weights in faster iterations (because we only process a small selection of samples in one go), it also often helps to move towards an optimum more quickly. Have a look at the answers here, for more information as to why using stochastic minibatches for training offers advantages.

One perhaps downside, is that the path to the optimum (assuming it would always be the same optimum) can be much noisier. So instead of a nice smooth loss curve, showing how the error descreases in each iteration of gradient descent, you might see something like this:

We clearly see the loss decreasing over time, however there are large variations from epoch to epoch (training batch to training batch), so the curve is noisy.

This is simply because we compute the mean error over our stochastically/randomly selected subset, from the entire dataset, in each iteration. Some samples will produce high error, some low. So the average can vary, depending on which samples we randomly used for one iteration of gradient descent.

thanks,

Briefly like this?

There are three variants of the Gradient Descent: Batch, Stochastic and Minibatch:

Batch updates the weights after all training samples have been evaluated.

Stochastic, weights are updated after each training sample.

The Minibatch combines the best of both worlds. We do not use the full data set, but we do not use the single data point. We use a randomly selected set of data from our data set. In this way, we reduce the calculation cost and achieve a lower variance than the stochastic version. – Developer – 2018-08-07T15:51:01.880

1I'd say there is batch, where a batch is the entire training set (so basically one epoch), then there is mini-batch, where a subset is used (so any number less than the entire set $N$) - this subset is chosen at random, so it is stochastic. Using a single sample would be referred to as *online learning*, and is a subset of mini-batch... Or simply mini-batch with `n=1`

. – n1k31t4 – 2018-08-07T19:34:02.193

tks, this is clear! – datdinhquoc – 2019-09-28T02:20:58.640

6

In Gradient Descent or Batch Gradient Descent, we use the whole training data per epoch whereas, in Stochastic Gradient Descent, we use only single training example per epoch and Mini-batch Gradient Descent lies in between of these two extremes, in which we can use a mini-batch(small portion) of training data per epoch, thumb rule for selecting the size of mini-batch is in power of 2 like 32, 64, 128 etc.

For more details: cs231n lecture notes

thanks,

Briefly like this?

There are three variants of the Gradient Descent: Batch, Stochastic and Minibatch:

Batch updates the weights after all training samples have been evaluated.

Stochastic, weights are updated after each training sample.

The Minibatch combines the best of both worlds. We do not use the full data set, but we do not use the single data point. We use a randomly selected set of data from our data set. In this way, we reduce the calculation cost and achieve a lower variance than the stochastic version. – Developer – 2018-08-07T15:51:06.660

4

**Gradient Descent** is an algorithm to minimize the $J(\Theta)$!

**Idea:** For current value of theta, calculate the $J(\Theta)$, then take small step in direction of negative gradient. Repeat.

Algorithm:

```
while True:
theta_grad = evaluate_gradient(J,corpus,theta)
theta = theta - alpha * theta_grad
```

But the problem is $J(\Theta)$ is the function of all corpus in windows, so very expensive to compute.

**Stochastic Gradient Descent** repeatedly sample the window and update after each one

Stochastic Gradient Descent Algorithm:

```
while True:
window = sample_window(corpus)
theta_grad = evaluate_gradient(J,window,theta)
theta = theta - alpha * theta_grad
```

Usually the sample window size is the power of 2 say 32, 64 as mini batch.

0

Both algorithms are quite similar. The only difference comes while iterating. In Gradient Descent, we consider all the points in calculating loss and derivative, while in Stochastic gradient descent, we use single point in loss function and its derivative randomly. Check out these two articles, both are inter-related and well explained. I hope it helps.

thanks,

Briefly like this?

There are three variants of the Gradient Descent: Batch, Stochastic and Minibatch:

Batch updates the weights after all training samples have been evaluated.

Stochastic, weights are updated after each training sample.

The Minibatch combines the best of both worlds. We do not use the full data set, but we do not use the single data point. We use a randomly selected set of data from our data set. In this way, we reduce the calculation cost and achieve a lower variance than the stochastic version. – Developer – 2018-08-07T15:50:52.693

Note that the above link to cs229-notes is down. However, Wayback Machine, aligned with date of post, delivers - yay! https://web.archive.org/web/20180618211933/http://cs229.stanford.edu/notes/cs229-notes1.pdf

– Eric Cousineau – 2021-01-17T18:17:51.720