Why is Reconstruction in Autoencoders Using the Same Activation Function as Forward Activation, and not the Inverse?



Suppose you have an input layer with n neurons and the first hidden layer has $m$ neurons, with typically $m < n$. Then you compute the actication $a_j$ of the $j$-th neuron in the hidden layer by

$a_j = f\left(\sum\limits_{i=1..n} w_{i,j} x_i+b_j\right)$, where $f$ is an activation function like $\tanh$ or $\text{sigmoid}$.

To train the network, you compute the reconstruction of the input, denoted $z$, and minimize the error between $z$ and $x$. Now, the $i$-th element in $z$ is typically computed as:

$ z_i = f\left ( \sum\limits_{j=1..m} w_{j,i}' a_j+b'_i \right) $

I am wondering why are the reconstructed $z$ are usually computed with the same activation function instead of using the inverse function, and why separate $w'$ and $b'$ are useful instead of using tied weights and biases? It seems much more intuitive to me to compute the reconstructed with the inverse activation function $f^{-1}$, e.g., $\text{arctanh}$, as follows:

$$ z_i' = \sum\limits_{j=1..m} \frac{f^{-1}(a_j)-b_j}{w_{j,i}^T} $$

Note, that here tied weights are used, i.e., $w' = w^T$, and the biases $b_j$ of the hidden layer are used, instead of introducing an additional set of biases for the input layer.

And a very related question: To visualize features, instead of computing the reconstruction, one would usually create an identity matrix with the dimension of the hidden layer. Then, one would use each column of the matrix as input to a reactivation function, which induces an output in the input neurons. For the reactivation function, would it be better to use the same activation function (resp. the $z_i$) or the inverse function (resp. the $z'_i$)?

Manfred Eppe

Posted 2016-01-12T23:39:55.800

Reputation: 101



I don't think that your assumption $w' = w^T$ holds. Or rather is not necessary, and if it is done, it is not in order to somehow automatically reverse the calculation to create the hidden layer features. It is not possible to reverse the compression in general, going from n to smaller m, directly in this way. If that was the goal, then you would want a form of matrix inversion, not simple transpose.

Instead we just want $w_{ij}$ for the compressed higher-level feature representation, and will discard $w'_{ij}$ after auto-encoder is finished.

You can set $w' = w^T$ and tie the weights. This can help with regularisation - helping the autoencoder generalise. But it is not necessary.

For the autoencoder to function it doesn't actually matter what activation function you use after the layer that you are pre-training, provided the last layer of the autoencoder can express the range of possible inputs. However, you may get varying quality of results depending on what you use, as normal for a neural network.

It is quite reasonable to use the same activation function that you are building the pre-trained layer for, as it is the simplest choice.

Using an inverse function is possible too, but not advisable for sigmoid or tanh, because e.g. arctanh is not defined < -1 or > 1, so likely would not be numerically stable.

Neil Slater

Posted 2016-01-12T23:39:55.800

Reputation: 24 613

Thanks! However, the $w' = w^T$ seems to be common practice, as it is, e.g., used in the very basic tutorial for denoising autoencoders of deeplearning.net: (http://deeplearning.net/tutorial/dA.html#daa) I do not find it so reasonable to use the same activation function for reconstruction, could you elaborate on this? Its true that it is the simplest choice, but it seems much more natural to me to use the $z'_i$ with the $arctanh$, because this yields actually the mathematical inverse of the activation.

– Manfred Eppe – 2016-01-13T21:20:47.370

You can if you want. E.g from http://deeplearning.net/tutorial/dA.html "Optionally, the weight matrix $W'$ of the reverse mapping may be constrained to be the transpose of the forward mapping: $W' = W^T$. This is referred to as tied weights." (Emphasis mine). The point of my answer is that if you do this, it is not in order to provide automatic reversal of the encoding, it is just a constraint which will regularise the training.

– Neil Slater – 2016-01-13T21:26:59.860

Thanks Neil. Your comment about the $w' = w^T$ issue helped me to generalize my question and make it more precise, so I edited the question accordingly. In fact, I actually don't understand why it is useful to have separate $w'$ at all, instead of always using the transposed matrix $w^T$. The answer might be "because it gives better results", but then I am wondering why it gives better results. It looks unintuitive to me. – Manfred Eppe – 2016-01-14T19:59:10.610

@ManfredEppe: Perhaps instead you should be thinking carefully about why you think the transposed weight matrix and inverse function would be useful? There is no specific reason to use them - what exactly is your intuition behind thinking that they would be useful? If it is for "symmetry" then take another look at the order in which they are applied - it is not a symmetric reversal of the input-to-hidden layer (if it were, the inverse activation function should be first) – Neil Slater – 2016-01-14T20:40:27.087