1

If I have a convolutional network which compresses from 256x256x3 images to, say, 100 hidden continuous states, then back to 256x256x3 and then train it with stochastic gradient descent (where the output is a denoised input), is it reasonable to expect it to learn the same or similar features as a similarly architected network, but one that uses variational methods (reconsruction + latent loss) techniques? Assuming they both have the same number of conv/max/norm steps between the input and the FC layers?

What about if I have a really small dataset, say 10 examples, and want to overfit? Will both of them arrive at similar representations at the hidden layer?

I don't think so (but I'm not sure). The reason is that with VAEs you are forcing the latent variables to have a normal distribution, whereas "traditional" autoencoders do not have this restriction (but may have others, such as sparsity) – BlackBear – 2017-09-05T11:54:06.110