In the simplest case, doing regression with Transformers is just a matter of changing the loss function.
BERT-like models that use the representation of the first technical token as an input to the classifier. You can replace the classifier with a regressor and pretty much nothing will change. The error from the regressor will get propagated to the rest of the network and you can both train the regressor and fine-tune/train the underlying Transformer.
Also, I don't think that layer normalization causes severe information loss. It is already there when the network is trained, so the rest of the network parameters need to take care of that, which should not be a problem because the gradients "know very well" that there was a normalization layer.