## transformers require a large d_model even when the input cardinality is low?

0

I'm training a transformer encoder for an NLP task over character data, so the cardinality of my input is 26. I've noticed that if I want to create a strong model, I need make $$x$$ == my embedding dimension == model dimension == output dimension larger, because the number of trainable parameters in my model heavily depends on $$x$$. Does it makes sense to represent characters with a large number of dimensions, e.g. $$x=300$$? In LSTM the number of parameters depends on the hidden size $$h$$ so I didn't have to do that...

Transformer encoder, single layer:

• attention: $$4\times[x\times x+x]$$
• norm: $$2\times[x+x]$$
• linear: $$2\times[x\times h]+x+h$$

LSTM, single layer:

• $$4\times[(x+h)\times h+h]$$