what does an auto-encoder do?

The simplest auto-encoder takes a high dimensional image (say, 100K pixels) down to a low-dimensional representation (say, a vector of length 10) and then uses *only* those 10 features to try to reconstruct the original image. You can imagine an analogy with humans: I look at someone, describe them ("tall, dark-haired, ...") then after I've forgotten what they look like, I try to sketch them using only my notes.

what are we trying to learn?

In other words, why bother? A few reasons:

- dimensionality reduction: 10 features are a lot more convenient than 100K pixels. For example, I can perform classification by clustering in the 10-dimensional space (while clustering in the 100K-dimensional space would be intractable).
- semantic meaning: if all goes well, each of the 10 features will have some obvious "explanation" -- e.g., tweaking one value will make the subject look older (though it's normally not so simple). As opposed to pixel values, which are impacted by translation, rotation, etc.
- Exception recognition: if I train my auto-encoder on dogs, it should normally do a good job encoding and decoding pictures of dogs. But if I put a cat in, it will probably do a terrible job -- which I can tell because the output looks nothing like the input. So, looking for places where an auto-encoder does a bad job is a common way to look for anomalies.

I have seen examples where an image goes from a non smiling to a smiling face, or to a black and white image to a colored image.

There are many different types of auto-encoders. What I described above is the simplest kind. Another common type is a "denoising" auto-encoder -- instead of reconstructing the original image, the goal is to construct an image that is related to the original image, but different.

The classic example of this is denoising (hence the name): you can take a clean image, add a bunch of noise, run it through an auto-encoder, and then reward the auto-encoder for producing the *clean* image. So, the input (noisy image) is actually different from the desired output (clean image). The examples you give are similar.

The challenge in designing these types of auto-encoders is normally the loss -- you need some mechanism to tell the auto-encoder whether it did the right thing or not.

about the VAE, it uses a probabilistic approch, so we have to learn the mean and covariance of a gaussian.

A VAE is a third type of auto-encoder. It's a bit special because it is well-grounded mathematically; no *ad-hoc* metrics needed. The math is too complicated to go through here, but the key ideas are that:

- We want the latent space to be continuous. Rather than assigning each class to its own corner of the latent space, we want the latent space to have a well-defined, continuous shape (i.e., a Gaussian). This is nice because it forces the latent space to be semantically meaningful.
- The mapping between pictures and latent spaces should be probabilistic rather than deterministic. This is because the same subject can produce multiple images.

So, the workflow is this:

- You start with your image as before
- As before, your encoder determines a vector (say, length 200).
- But that vector is not a latent space. Instead, you use that vector as the parameters to define a latent space. For example, maybe you choose your latent space to be a 100-dimensional Gaussian. A 100-dimensional Gaussian will require a mean and a standard deviation in each dimension -- this is what you use your length-200 vector for.
- Now you have a probability distribution. You sample one point from this distribution. This is your image's representation in the latent space.
- As before, your decoder will turn this vector into a new "output" (say, a vector of length 200K).
- But, this "output" is not your output image. Instead, you use these 200K parameters to define a 100K-dimensional Gaussian. Then you sample one point from this distribution -- that's your output image.

Of course, there's nothing special about a Gaussian, you could just as easily use some other parametric distribution. In practice, people usually use Gaussians.

This sometimes gives better results than other auto-encoders. Further, you sometimes get interesting results when you look *between* the classes in your latent space. An image's distance in the latent space from the cluster center is sometimes related to uncertainty.

Moreover, there is the nice property that these high-dimensional Gaussians are *probability distributions* in a rigorous mathematical sense. They approximate the probability that a given image belongs to a given class. So, there is some thought that VAEs will be able to overcome the "hand waving" of deep learning and put everything back on a firm Bayesian probabilistic grounding. But of course, it is only an approximation, and the approximation involves a lot of deep neural networks, so there is still plenty of hand waving at the moment.

By the way, I like to use this question during interviews -- an astonishing number of people claim to have experience with VAEs but in fact do not realize that VAEs are different than "regular" AEs.

2This answer is top class and complete. Kudos! – Yahya – 2020-08-18T13:19:43.713