I am trying to understand the transformer architecture.
I am aware that the encoder/decoder contains multiple stacked self attention layers. Further each layer contains multiple heads. For example take 8 heads.
Now for a particular layer we will have 8 different sets of (Wq, Wk, Wv), the weight matrices used to calculate the query, key and value.
Now what I want to know is whether these weight matrices are shared between the different layers i.e are the (Wq, Wk, Wv) matrices of head#1 in layer 1 same for head#1 of layers 2, 3, ....?
And if they are shared, doesn't it affect in parallelization?