How/What to initialize the hidden states in RNN sequence-to-sequence models?



In an RNN sequence-to-sequence model, the encode input hidden states and the output's hidden states needs to be initialized before training.

What values should we initialize them with? How should we initialize them?

From the PyTorch tutorial, it simply initializes zeros to the hidden states.

Is initializing zero the usual way of initializing hidden states in RNN seq2seq networks?

How about glorot initialization?

For a single-layer vanilla RNN wouldn't the fan-in and fan-out be equals to $(1 + 1)$ which gives a variance of $1$ and the gaussian distribution with $mean=0$ gives us a uniform distribution of $0$s.

for-each input-hidden weight
  variance = 2.0 / (fan-in +fan-out)
  stddev = sqrt(variance)
  weight = gaussian(mean=0.0, stddev)

For single layer encoder-decoder architecture with attention, if we use glorot, we'll get a very very small variance when initializing the decoder hidden state since the fan-in would include the attention which is mapped to all possible vocabulary from the encoder output. So we result in a gaussian mean of ~= 0 too since stdev is really really small.

What other initialization methods are there, esp. for the use on RNN seq2seq models?


Posted 2018-01-30T06:30:54.517

Reputation: 2 242



It is important to clear up the difference between hidden state initialization and weight initialization. Glotrot (Xavier), Kaiming etc. are all initialization methods for the weights of neural networks.

Since your question is asking about hidden state initialization: Hidden states on the other hand can be initialized in a variety of ways, initializing to zero is indeed common. Other methods include sampling from Gaussian or other distributions. In relation to RNN's this defines what a RNN starts with as its 'memory'. Two common approaches seem to be either a noisy initialization (from some sort of distribution or a random number generator), or a learned initialization.

To synthesize the link above; initializing hidden states with zeros can lead to the network learning to adapt from a zero hidden state, rather than minimizing the loss for a long sequence (it follows that this is more of a problem for short sequences). If there are enough sequences it can make sense to have the initial state be a trained variable that is a function of the error during back propagation.

Mati K

Posted 2018-01-30T06:30:54.517

Reputation: 65