No, the activation of the output layer should instead be tailored to the labels you're trying to predict. The network prediction can be seen as a distribution, for example a categorical for classification or a Gaussian (or something more flexible) for regression. The output of your network should predict the sufficient statistics of this distribution. For example, a softmax activation on the last layer ensures that the outputs are positive and sum up to one, as you would expect for a categorical distribution. When you predict a Gaussian with mean and variance, you don't need an activation for the mean but the variance has to be positive, so you could use exp as activation for that part of the output.

Okay, and what about dropout/batch-norm? Should they be after the last fully connected layer? – Gilad Deutsch – 2020-05-01T13:05:45.543