I am trying to use Convultional Auto-Encoder for its latent space (embedding layer), specifically, I want to use the embedding for K-nearest neighbor search in the latent space (similar idea to word2vec).
My input is 3x224x224 (ImageNet), I could not find any article that elaborates a specific architecture (in terms of number of filters, number of conv layers, etc.) I tried some arbitrary architectures like:
But I'd like to start my hyper-parameters search from a set up that proved itself on a similar task. Can you refer me to a source or suggest an architecture that worked for you for this purpose?