2

I've recently been looking at autoencoders and kernel PCA for unsupervised feature extraction.

Lets consider just for a moment linear PCA. Its my understanding that if a autoencoder (with a single hidden layer) has linear activation functions then the sum-of-squares error function (chosen for convenience rather than statistical relevance) will have a unique solution for the weights. This can be easily seen, consider the sum-of-squares error function for the autoencoder input $\boldsymbol{x}$ and the autoencoder output $\hat{\boldsymbol{x}}$, that we wish to solve:

$$ \begin{aligned} & \min_{\boldsymbol{W},\boldsymbol{V}} {1\over 2N}\sum^N_{n=1} \|\boldsymbol{x}^{(n)} - \hat{\boldsymbol{x}}^{(n)}\|^2\\ \end{aligned} $$ An autoencoder is a neural network whose input is its output. Defining the following: Let $\boldsymbol{W}$ be the weights matrix between the input layer and the hidden layer and $\boldsymbol{V}$ be the weights matrix between the hidden layer and the output layer. I'm ignoring the bias terms for now. Let $\psi$ be the activation function for each unit on the hidden layer and let $\phi$ be the activation function for each unit on the output layer. Then we have two mappings in an autoencoder:

$$ \boldsymbol{z} = \psi\left(\boldsymbol{W}\boldsymbol{x}\right), \:\:\:\:\:\:\: \hat{\boldsymbol{x}} = \phi\left(\boldsymbol{V}\boldsymbol{z}\right) $$

$$ \begin{aligned} & \min_{\boldsymbol{W},\boldsymbol{V}} {1\over 2N}\sum^N_{n=1} \|\boldsymbol{x}^{(n)} - \phi\left(\boldsymbol{V}\psi\left(\boldsymbol{W}\boldsymbol{x}^{(n)}\right)\right)\|^2\\ \end{aligned} \label{eqn: 1} $$ If $\psi$ and $\phi$ are linear, i.e $\phi(\boldsymbol{s})=\boldsymbol{s}$ and $\psi(\boldsymbol{s})=\boldsymbol{s}$, then $\hat{\boldsymbol{x}} = \boldsymbol{V}\boldsymbol{W} \boldsymbol{x} = \boldsymbol{A} \boldsymbol{x}$ (i.e rotation) and $$ \begin{aligned} & \min_{\boldsymbol{W},\boldsymbol{V}} {1\over 2N}\sum^N_{n=1} \|\boldsymbol{x}^{(n)} - \boldsymbol{V}\boldsymbol{W}\boldsymbol{x}^{(n)}\|^2\\ & \min_{\boldsymbol{W},\boldsymbol{V}} {1\over 2N}\sum^N_{n=1} \|\boldsymbol{x}^{(n)} - \boldsymbol{A}\boldsymbol{x}^{(n)}\|^2\\ \end{aligned} $$

This is recognised as principle component analysis. This solution for the weights "forms a basis set which spans the [linear] principle component space" (Bishop, Pattern Recognition and Machine Learning). Bishop then goes on to say that these weight vectors aren't necessarily orthogonal or normalised. I assume that this is because we have not proposed any constraints on $\boldsymbol{A}$ as we would have done in PCA by way of proposing an orthogonal basis?

So am I right in thinking that the autoencoder I've described above does not in general give the linear PCA solution, but in some very special cases they may match?

Now bishop goes on to say that the only way to handle non-linear inputs is to add more hidden layers to the autoencoder and have some layers use non-linear activation functions. However one would assume that the same problem as above persists? i.e. that one is not guaranteed to get the same solution as PCA.

There seem to be additional issues [Bishop]:

- Training requires solving a non-linear optimisation problem.
- One may end up with a solution that corresponds to a local minima, not a global one.
- One must specify the number of subspace dimensions as part of the network architecture.

Therefore my question is, why use autoencoders at all for unsupervised feature extraction? Why not just use kernal PCA? Which will reduce to linear PCA when the data permits it. Is there some other advantage that I'm missing? Under what conditions should an autoencoder be chosen over kernel PCA?