Should batch-normalization/dropout/activation-function layers be used after the last fully connected layer?


I am using the following architechture:

3*(fully connected -> batch normalization -> relu -> dropout) -> fully connected

Should I add the batch normalization -> relu -> dropout part after the last fully connected layer as well (the output is positive anyway, so the relu wouldn't hurt I suppose)?

Gilad Deutsch

Posted 2020-04-27T11:02:50.867

Reputation: 509



No, the activation of the output layer should instead be tailored to the labels you're trying to predict. The network prediction can be seen as a distribution, for example a categorical for classification or a Gaussian (or something more flexible) for regression. The output of your network should predict the sufficient statistics of this distribution. For example, a softmax activation on the last layer ensures that the outputs are positive and sum up to one, as you would expect for a categorical distribution. When you predict a Gaussian with mean and variance, you don't need an activation for the mean but the variance has to be positive, so you could use exp as activation for that part of the output.


Posted 2020-04-27T11:02:50.867

Reputation: 151

Okay, and what about dropout/batch-norm? Should they be after the last fully connected layer? – Gilad Deutsch – 2020-05-01T13:05:45.543