Basic encoder-decoder architecture



I read a couple of posts (like this one) about the encoder-decoder architecture and their implementation. But I still don't understand a couple of things.

  1. What is the difference between a basic CNN or RNN and encoder decoder ? Are there some properties that the encoder and decoder need to satisfy ? As far as I understood the encoder encodes the input in another dimension and creates a context vector. Later this context vector is decoded by the decoder.

  2. I know that their are different types of encoders and decoders, what is the most simple architecture to implement ?

  3. Should I use predefined word embeddings like word2vect in the encoder(if i have text as an input) ?


Posted 2018-03-09T16:31:13.517

Reputation: 31



Encoder and decoder are highly overloaded terms. As a generic definition, an encoder-decoder neural architecture has a part of the network called "encoder" that receives an input and generates a code (i.e. expresses the input in a different representation space) and another part called "decoder" that takes a given code and converts it to the output representation space. Normally, the dimensionality of the code is much less than that of the input/output representation spaces.

In principle, the architecture of the encoder and the decoder is arbitrary, i.e. they can be CNN's, RNN's, multilayer perceptrons or whatever other thing. This highly depends on the task. This should answer question (1).

There are two contexts where encoder-decoders network organization is usual:

  • image autoencoders: normally, the encoder is a CNN and the decoder is the analogous deconvolution.
  • sequence-to-sequence tasks: normally, both encoder and decoder are GRUs or LSTMs.

As you mention word embeddings and a blog post about text summarization, I guess your problem relates to NLP, so you may start with the Keras blog seq2seq tutorial. This should answer question (2).

The appropriateness of using pre-trained word embeddings depends on the task and the amount of training data you have. When you don't have a lot of data, using pre-trained embeddings is a form of data augmentation to get better results. For translation, it is not frequent to see pre-trained word embeddings. This should answer question (3).


Posted 2018-03-09T16:31:13.517

Reputation: 10 494


The autoencoder which uses convolutional layers is essentially a subset of CNN architectures.

The idea of an encoder is exactly as you stated you want to go from a space $\mathbb{R}^n$ to $\mathbb{R}^m$ where $m<n$. There are a lot of ways to compress information in such a way and these techniques used to be called compressed sensing. The idea was then to find automated ways to generate this mapping from n-dimensions to m-dimensions. As the name suggests it is an autoencoder. Lately the most popular method for doing this is using a CNN let's just refer to this specific type of autoencoder as an autoencoder for simplicity, but remember there are other ways of doing this.

The autoencoder has a special structure where we will constrain the number of parameters at the center layer. The encoder is the part of the network which compresses the information into the m-dimensional space. We then use a decoder to reconstruct the input from the compressed data. An autoencoder is trained by feeding the same input and output. We want the network to recreate the input as closely as possible. If the output is equal to the input then we have perfect reconstruction and all the information contained in the input is contained. When we constrain the dimensions there will always be some information loss. We will want to minimize this loss.

Here is an example of a vanilla autoencoder

input_img = Input(shape=(28, 28, 1))  # adapt this if using `channels_first` image data format

x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)

# at this point the representation is compressed

x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)

autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')

What happens if the network just learn the identity function!!

For this problem the denoising autoencoder was conceived. Add white Gaussian noise to the input and train using the original image at the output. This forces the network to learn the underlining distribution of the images and not just copy them to the output. You can use the same network as above. The noise can be added as follows

noise_factor = 0.5 
x_train_noisy = x_train_reshaped + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train_reshaped.shape) 
x_test_noisy = x_test_reshaped + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_test_reshaped.shape) 
x_train_noisy = np.clip(x_train_noisy, 0., 1.) 
x_test_noisy = np.clip(x_test_noisy, 0., 1.)


Posted 2018-03-09T16:31:13.517

Reputation: 7 863

Minor Question: Why would the autoencoder be able to learn identity function? Shouldn't that be the case only when bottleneck layer has more features than input. PS. Ofcourse it shouldnot be then called bottleneck, but thats another issue :D. – saurabheights – 2019-06-07T13:14:54.233