transformers require a large d_model even when the input cardinality is low?


I'm training a transformer encoder for an NLP task over character data, so the cardinality of my input is 26. I've noticed that if I want to create a strong model, I need make $x$ == my embedding dimension == model dimension == output dimension larger, because the number of trainable parameters in my model heavily depends on $x$. Does it makes sense to represent characters with a large number of dimensions, e.g. $x=300$? In LSTM the number of parameters depends on the hidden size $h$ so I didn't have to do that...

Transformer encoder, single layer:

  • attention: $4\times[x\times x+x]$
  • norm: $2\times[x+x]$
  • linear: $2\times[x\times h]+x+h$

LSTM, single layer:

  • $4\times[(x+h)\times h+h]$


Posted 2021-02-27T11:37:46.763

Reputation: 1 137

No answers