I can’t figure about which variables the lecturer is talking about. Are they weights and biases or are they soft-max function and labels in the Big-loss function?

In the case of neural networks, the loss function depends on the weights and the biases (which are often not mentioned extra; they are also weights).

The loss function itself is parametrized heavily. Its parameters is the data (input and labels). Its variables (by which the derivatives are calculated) are weights.

And how would zero mean and equal variance help in optimization?

Have a look at the derivative of the sigmoid function. It is biggest at 0. This means the gradient there can be big. This helps by learning, because the basic learning rule is

$$w \gets w_i + \Delta w_i\;\;\; \text{ with } \Delta w_i = - \eta \frac{\partial E}{\partial w_i}$$

So you can get bigger adjustments if you normalize it to have mean of 0.

The part about the variance ... hm. That's harder to explain. I'm not entirely sure about it. One thought is that you want the data to be in a very restricted, similar domain (independent of the application) so that you can treat results independently from your application. Also, it might help with mini-batches not varying too much.

in the video the lecturer talks about weight initialization randomly using Gaussian distribution, I cannot understand how can we initialize weights using Gaussian distribution with zero mean and standard deviation sigma?

I'm not sure what exactly the question is.

- Practically: Just use a function. For example,
`numpy.random.normal`

if you're using Python.
- Mathematically: This is called
*sampling*. You have a process which has a random distribution and you take examples from that.
- Technically: If you really want to implement this yourself, you will most likely end up using a random number generator which generates samples from a uniform distribution in [0, 1].

Once you have those sample numbers, you simply assign those numbers to the weights.

And where optimizer will move the point (initialized weight), up or down to find the local minima?

The optimizer calculates the gradient of the error function. I recommend having a look at the early chapters of the Udacity course. I'm pretty sure this was covered there, too. Another resource would be neuralnetworksanddeeplearning.com

So can you recommend me a book that I should study before taking this course, so that problems like these don't occur?

Tom Mitchells book "Machine Learning" covers similar topics as the course.