I'm training a transformer encoder for an NLP task over character data, so the cardinality of my input is 26. I've noticed that if I want to create a strong model, I need make $x$ == my embedding dimension == model dimension == output dimension larger, because the number of trainable parameters in my model heavily depends on $x$. Does it makes sense to represent characters with a large number of dimensions, e.g. $x=300$? In LSTM the number of parameters depends on the hidden size $h$ so I didn't have to do that...
Transformer encoder, single layer:
- attention: $4\times[x\times x+x]$
- norm: $2\times[x+x]$
- linear: $2\times[x\times h]+x+h$
LSTM, single layer:
- $4\times[(x+h)\times h+h]$