10

7

Suppose you have an input layer with n neurons and the first hidden layer has $m$ neurons, with typically $m < n$. Then you compute the actication $a_j$ of the $j$-th neuron in the hidden layer by

$a_j = f\left(\sum\limits_{i=1..n} w_{i,j} x_i+b_j\right)$, where $f$ is an activation function like $\tanh$ or $\text{sigmoid}$.

To train the network, you compute the reconstruction of the input, denoted $z$, and minimize the error between $z$ and $x$. Now, the $i$-th element in $z$ is typically computed as:

$ z_i = f\left ( \sum\limits_{j=1..m} w_{j,i}' a_j+b'_i \right) $

I am wondering why are the reconstructed $z$ are usually computed with the same activation function instead of using the inverse function, and why separate $w'$ and $b'$ are useful instead of using tied weights and biases? It seems much more intuitive to me to compute the reconstructed with the inverse activation function $f^{-1}$, e.g., $\text{arctanh}$, as follows:

$$ z_i' = \sum\limits_{j=1..m} \frac{f^{-1}(a_j)-b_j}{w_{j,i}^T} $$

Note, that here tied weights are used, i.e., $w' = w^T$, and the biases $b_j$ of the hidden layer are used, instead of introducing an additional set of biases for the input layer.

And a very related question: To visualize features, instead of computing the reconstruction, one would usually create an identity matrix with the dimension of the hidden layer. Then, one would use each column of the matrix as input to a reactivation function, which induces an output in the input neurons. For the reactivation function, would it be better to use the same activation function (resp. the $z_i$) or the inverse function (resp. the $z'_i$)?

Thanks! However, the $w' = w^T$ seems to be common practice, as it is, e.g., used in the very basic tutorial for denoising autoencoders of deeplearning.net: (http://deeplearning.net/tutorial/dA.html#daa) I do not find it so reasonable to use the same activation function for reconstruction, could you elaborate on this? Its true that it is the simplest choice, but it seems much more natural to me to use the $z'_i$ with the $arctanh$, because this yields actually the mathematical inverse of the activation.

– Manfred Eppe – 2016-01-13T21:20:47.370You can if you want. E.g from http://deeplearning.net/tutorial/dA.html "

– Neil Slater – 2016-01-13T21:26:59.860Optionally, the weight matrix $W'$ of the reverse mapping may be constrained to be the transpose of the forward mapping: $W' = W^T$. This is referred to as tied weights." (Emphasis mine). The point of my answer is that if you do this, it is not in order to provide automatic reversal of the encoding, it is just a constraint which will regularise the training.Thanks Neil. Your comment about the $w' = w^T$ issue helped me to generalize my question and make it more precise, so I edited the question accordingly. In fact, I actually don't understand why it is useful to have separate $w'$ at all, instead of always using the transposed matrix $w^T$. The answer might be "because it gives better results", but then I am wondering

whyit gives better results. It looks unintuitive to me. – Manfred Eppe – 2016-01-14T19:59:10.610@ManfredEppe: Perhaps instead you should be thinking carefully about why you think the transposed weight matrix and inverse function would be useful? There is no specific reason to use them - what exactly is your intuition behind thinking that they would be useful? If it is for "symmetry" then take another look at the order in which they are applied - it is not a symmetric reversal of the input-to-hidden layer (if it were, the inverse activation function should be first) – Neil Slater – 2016-01-14T20:40:27.087