Scaling the activation function

1

It is obvious that I have to scale the output data if the range of values is between say [-10;10] and the activation function of the output layer takes values in the interval [-1;1]. But I could also scale the activation function by multiplying it with the factor 10 instead. It seems to me that it is more common to scale the data and not the function. Is there a reason for it?

Anne

Posted 2020-12-18T15:46:46.293

Reputation: 11

Answers

0

All it does is make a difference in the step size of the learning step. Scaling the outputs down makes the gradients smaller, so the gradient descent updates are smaller. We want that so we do not jump over a good solution with bigger steps.


Let's say the activation function is $a =\sigma(x)$ where $x$ is the input to the activation function and $a$ is the output.

The range of true outputs $y$ is on the order of 10 times of the range of $\sigma$. We can either scale $\sigma$ up, or scale $y$ down.

Scaling activation function

Lets scale the output by 10: $y' = 10a$.

The loss function $L$ compares predictions vs. truth: $L(y', y)$. We need the gradient w.r.t $x$ for backpropagation. That is, to find how the loss changes with respect to input so we can change the input:

$$ \frac{\partial L(y', y)}{\partial x} = \frac{\partial L(y', y)}{\partial y'} \times \frac{\partial y'}{\partial a} \times \frac{\partial a}{\partial x} \\= \frac{\partial L(y', y)}{\partial y'} \times 10 \times \frac{\partial a}{\partial x} \\ = \frac{\partial L(10a, y)}{10\cdot \partial a} \times 10 \times \frac{\partial a}{\partial x} \\= \frac{\partial L(10a, y)}{\partial a} \times \frac{\partial a}{\partial x} \\= \frac{\partial L(10 \cdot (a, 0.1y))}{\partial a} \times \frac{\partial a}{\partial x} $$

Scaling output

On the other hand, lets scale down to data, so $y_s = 0.1y$. This means we do not need to scale $a$. The loss function gradient now is:

$$ \frac{\partial L(a, y_s)}{\partial x} = \frac{\partial L(a, 0.1y)}{\partial a} \times \frac{\partial a}{\partial x} $$


Now for both cases, note the final forms of the gradient. The only difference is the arguments of $L$. For the case where the output was scaled, the arguments of the loss function are 10 times smaller. Which means that the gradient will be smaller. Which means that the step update made to $x$ will be smaller. We usually want small updates so we can converge to an optimal solution.

But also note, we can make the step size smaller anyways by reducing the learning rate too.

So scaling the output down instead of the activation up is a nice rule of thumb to get better convergence. It is not a rule.

hazrmard

Posted 2020-12-18T15:46:46.293

Reputation: 273