What is the best architecture for Auto-Encoder for image reconstruction?



I am trying to use Convultional Auto-Encoder for its latent space (embedding layer), specifically, I want to use the embedding for K-nearest neighbor search in the latent space (similar idea to word2vec).

My input is 3x224x224 (ImageNet), I could not find any article that elaborates a specific architecture (in terms of number of filters, number of conv layers, etc.) I tried some arbitrary architectures like:


  • Conv(channels=3,filters=16,kernel=3)
  • Conv(channels=16,filters=32,kernel=3)
  • Conv(channels=32,filters=64,kernel=3)


  • Deconv(channels=64,filters=32,kernel=3)
  • Deconv(channels=32,filters=16,kernel=3)
  • Deconv(channels=16,filters=3,kernel=3)

But I'd like to start my hyper-parameters search from a set up that proved itself on a similar task. Can you refer me to a source or suggest an architecture that worked for you for this purpose?

Idan azuri

Posted 2019-04-22T09:25:25.467

Reputation: 133

There is none. You should always optimize your network through an "ad hoc" hyperparameter search that depends on the problem at hand. – pcko1 – 2019-04-22T16:04:05.207

2@pcko1 disagree, in many cases, it is very helpful to use a similar problem architecture and then to make the fine-tuning. Moreover, my dataset is ImageNet which is very investigated. Last, until you didn't cover all the articles in arxiv you can't say "there is none"... – Idan azuri – 2019-04-22T22:30:10.087

@Idanazuri: Did you find any good architecture for imagenet? – saurabheights – 2019-06-07T13:15:20.920

@saurabheights Yet I didn't find any benchmark for reconstruction task, so I used DDCGAN architecture for the decoder, as for the encoder I used its reflection. It yields decent results. – Idan azuri – 2019-06-12T14:43:48.360

Thank you :). Will give it a try. If you have it open source, please let me know. – saurabheights – 2019-06-12T20:02:26.340



I don't know about an architecture being definitively the best, but there are some best practices you can follow. Check out these papers:

To sum it up, residual blocks in between downsampling, SSIM as a loss function, and larger feature map sizes in the bottleneck seem to improve reconstruction quality significantly. How that translates to the latent space is not entirely clear yet.

Ilja Manakov

Posted 2019-04-22T09:25:25.467

Reputation: 26