## How can autoencoders be used for clustering?

11

5

Suppose I have a set of time-domain signals with absolutely no labels. I want to cluster them in 2 or 3 classes. Autoencoders are unsupervised networks that learn to compress the inputs. So given an input $x^{(i)}$, weights $W_1$ and $W_2$, biases $b_1$ and $b_2$, and output $\hat{x}^{(i)}$, we can find the following relationships:

$$z^{(i)} =W_1x^{(i)}+b_1$$ $$\hat{x}^{(i)} =W_2z^{(i)}+b_2$$

So $z^{(i)}$ would be a compressed form of $x^{(i)}$, and $\hat{x}^{(i)}$ the reconstruction of the latter. So far so good.

What I don't understand is how this could be used for clustering (if there is any way to do it at all). For example, in the first figure of this paper, there is a block diagram I'm not sure I understand. It uses the $z^{(i)}$ as the inputs to the feed-forward network, but there is no mention to how that network is trained. I don't know if there is something I'm ignoring or if the paper is incomplete. Also, this tutorial at the end shows the weights learned by the autoencoder, and they seem to be kernels a CNN would learn to classify images. So... I guess the autoencoder's weights can be used somehow in a feed-forward network for classification, but I'm not sure how.

My doubts are:

1. If $x^{(i)}$ is a time-domain signal of length $N$ (i.e. $x^{(i)}\in\mathbb{R}^{1\times N}$), can $z^{(i)}$ only be a vector as well? In other words, would it make sense for $z^{(i)}$ to be a matrix with one of its dimensions greater than $1$? I believe it would not, but I just want to check.
2. Which of these quantities would be the input to a classifier? For example, if I want to use a classic MLP that has as many output units as classes I want to classify the signals in, what should I put at the input of this fully-connected network ($z^{(i)}$,$\hat{x}^{(i)}$, any other thing)?
3. How can I use the learned weights and biases in this MLP? Remember that we assumed that absolutely no labels are available, so it is impossible to train the network. I think the learned $W_i$ and $b_i$ should be useful somehow in the fully-connected network, but I don't see how to use them.

Observation: note that I used an MLP as an example because it is the most basic architecture, but the question applies to any other neural network that could be used to classify time-domain signals.

11

Clustering is difficult to do in high dimensions because the distance between most pairs of points is similar. Using an autoencoder lets you re-represent high dimensional points in a lower-dimensional space. It doesn't do clustering per se - but it is a useful preprocessing step for a secondary clustering step. You would map each input vector $x_i$ to a vector $z_i$ (not a matrix...) with a smaller dimensionality, say 2 or 3. You'd then use some other clustering algorithm on all the $z_i$ values.

Maybe someone else can chime in on using auto-encoders for time series, because I have never done that. I would suspect that you would want one of the layers to be a 1D convolutional layer, but I am not sure.

Some people use autoencoders as a data pre-processing step for classification too. In this case, you would first use an autoencoder to calculate the $x$-to-$z$ mapping, and then throw away the $z$-to-$\hat{x}$ part and use the $x$-to-$z$ mapping as the first layer in the MLP.

And in the last case, how would the weights of the other layers in the MLP be learned if the data is totally unlabeled? Or would that approach (i.e. autoencoder-MLP combination) only make sense if labels are available? – Tendero – 2017-12-15T20:21:42.217

Yes, an MLP (aka feed-forward neural network) is only really used if the data is labeled. Otherwise you have no info to use to update the weights. An autoencoder is sort of a 'trick' way to use neural networks because you are trying to predict the original input and don't need labels. – tom – 2017-12-15T20:27:07.390

So the only way to use a NN to do clustering would be the method you mentioned, right? Namely, use an autoencoder and then run a standard clustering algorithm such as K-means. – Tendero – 2017-12-15T20:28:52.203

That's the only way I know. If someone else has an idea I'd be happy to hear it. You might try other algorithms besides K-means though, since there are some pretty strict assumptions associated with that particular algorithm (but still it's a good thing to try first b/c it's fast and easy). – tom – 2017-12-15T21:30:19.680

1

## Before asking 'how can autoencoder be used to cluster data?' we must first ask 'Can autoencoders cluster data?'

Since an autoencoder learns to recreate the data points from the latent space. If we assume that the autoencoder maps the latent space in a “continuous manner”, the data points that are from the same cluster must be mapped together. Hence in a way, the encoder will group similar points “together”, cluster them “together”. We've seen in literature that autoencoders fail to hold this assumption of continuity in the latent space.

But to our benefit, variational autoencoders work exactly in this manner. Variational encoders learn the latent space mappings with the two main properties: continuity, completeness1.
• The continuity property ensures that two points close to each other in the latent space do not give two completely different outputs when decoded.
• The completeness property ensures that on sampling a point from the latent space will give a meaningful output when decoded.

Therefore using an autoencoders encoding can itself, might sometimes be enough. However, work has been done to improvise/learn the clustering explicitly. The algorithm proposed by Xie et al. (2016)2 is an example, which "iteratively refines clusters with an auxiliary target distribution derived from a current soft cluster assignment."

0

Suppose an n dimensional dataset to be sorted into k clusters.

If we have a 3 layer network:

input layer: n neurons

hidden layer: k neurons, softmax activation

output layer: n neurons, linear activation

If we use our dataset as both x and y (this is a typical autoencoder, yes?), and we use a loss function that minimizes the euclidean distance between the predicted and actual values, will we not end up with a layer that is trained to cluster?

I would think that the output of the second layer would end up being the cluster that the data point belonged to. I would think that if we removed the input layer and instead added an input layer that was k dimensions and passed a one-hot value for each cluster to the predict function we'd get the centroids, and I would think that the error at convergence during training could be used for plotting a variance graph and finding an 'elbow' to determine the optimal number of clusters.