Weight matrices in transformers


I am trying to understand the transformer architecture.

I am aware that the encoder/decoder contains multiple stacked self attention layers. Further each layer contains multiple heads. For example take 8 heads.

Now for a particular layer we will have 8 different sets of (Wq, Wk, Wv), the weight matrices used to calculate the query, key and value.

Now what I want to know is whether these weight matrices are shared between the different layers i.e are the (Wq, Wk, Wv) matrices of head#1 in layer 1 same for head#1 of layers 2, 3, ....?

And if they are shared, doesn't it affect in parallelization?


Posted 2019-12-05T10:34:50.910

Reputation: 129

The heads are not shared among layers. – Astariul – 2019-12-09T02:24:36.350

No answers