In an RNN sequence-to-sequence model, the encode input hidden states and the output's hidden states needs to be initialized before training.
What values should we initialize them with? How should we initialize them?
From the PyTorch tutorial, it simply initializes zeros to the hidden states.
Is initializing zero the usual way of initializing hidden states in RNN seq2seq networks?
How about glorot initialization?
For a single-layer vanilla RNN wouldn't the fan-in and fan-out be equals to $(1 + 1)$ which gives a variance of $1$ and the gaussian distribution with $mean=0$ gives us a uniform distribution of $0$s.
for-each input-hidden weight variance = 2.0 / (fan-in +fan-out) stddev = sqrt(variance) weight = gaussian(mean=0.0, stddev) end-for
For single layer encoder-decoder architecture with attention, if we use glorot, we'll get a very very small variance when initializing the decoder hidden state since the fan-in would include the attention which is mapped to all possible vocabulary from the encoder output. So we result in a gaussian mean of ~= 0 too since stdev is really really small.
What other initialization methods are there, esp. for the use on RNN seq2seq models?