Transformer-based architectures for regression tasks


As far as I've seen, transformer-based architectures are always trained with classification tasks (one-hot text tokens for example). Are you aware of any architectures using attention and solving regression tasks? Could one build a regressive auto-encoder for example? How would normalization fit into this (as LayerNorm destroys some of the information from the input)?

Damjan Dakic

Posted 2020-05-26T18:03:35.377

Reputation: 141



In the simplest case, doing regression with Transformers is just a matter of changing the loss function.

BERT-like models that use the representation of the first technical token as an input to the classifier. You can replace the classifier with a regressor and pretty much nothing will change. The error from the regressor will get propagated to the rest of the network and you can both train the regressor and fine-tune/train the underlying Transformer.

Also, I don't think that layer normalization causes severe information loss. It is already there when the network is trained, so the rest of the network parameters need to take care of that, which should not be a problem because the gradients "know very well" that there was a normalization layer.


Posted 2020-05-26T18:03:35.377

Reputation: 888

Regarding the information loss, I was more referring to the autoencoder scenario. Because of LayerNorm, mean and variance of each input sequence will be completely destroyed. With that in mind, how could autoencoder ever learn to replicate input? Gradients "know very well" that there was a normalization layer in terms of learned affine transform, but a portion of information is truly lost. – Damjan Dakic – 2020-06-25T09:50:15.983