## How to make sure that the features learned by a neural network are not correlated?

4

1

Each layer of a neural network learns features of the input data. The first layer learns low-level features (e.g. edges in images). Each subsequent layer learns more abstract features. Then the features from the last hidden layer are combined for classification or regression.

How to make sure that the features learned by a neural network are not correlated?

I think that I can force a neural network to learn correlated features in the following way. Remove the output layer, so that the last hidden layer produces $$n$$ outputs. Then train this network on $$n$$ target variables that are not independent but are linear combinations of say two or three independent targets (maybe with some noise added to avoid true linear dependence). As neural networks are universal approximators, it should be possible to train such a network. However, in that case, all neurons in the final layer will be redundant except those two or three neurons that correspond to the linearly independent targets.

This redundancy can be present in each layer. How to avoid it? Of course, I can check the correlation of neuron outputs after the training, remove the correlated neurons, and retrain the network but it looks inefficient. How to make the network learn uncorrelated features from the beginning?

What comes to my mind is using different and independent (e.g. orthogonal) activation functions for each neuron within the layer. Are there such networks? Usually, activation functions are taken to be the same.

1What is your goal in doing away with correlated features? – Dave – 2021-02-17T15:03:37.363

2

Dropout (2014 paper) is my first thought.

By effectively removing N% of your neurons on each pass through the data, you make it harder for any two neurons to work together. When its buddy disappears it is forced to find another way to learn the patterns in the data. On the next epoch, when its old buddy comes back, it has a new perspective on life, its horizons have been broadened, and it is less likely to re-enter that co-dependent relationship.

The second thought was L2 regularization: it forces the network to not rely on just one neuron in a layer, but to get them all involved.

If you are not convinced, just try it. Try to create a network where, say, only two of the neurons in the final layer carry all the signal. Then re-train it using a dropout of 50%, or a higher L2 value. It may be a weaker model overall, but it is going to be really hard to not have all the neurons share the load. (Also try it with and without the bias, as mentioned below.)

different and independent activation functions for each neuron within the layer. Are there such networks?

I'd be interested to hear if you find anything. I suppose you could argue that the bias is creating a unique activation function on each neuron, though I think you are suggesting something like ReLU in some neurons and tanh in others? (A bit tricky to implement efficiently?)

A little bit removed from your question, but the different heads in multi-head attention (in the Transformer model) are another example of getting multiple different "activation functions" into a layer.

>

• How dropout and other regularizations make the learned features uncorrelated? As far as I understand, these techniques force the net to learn simpler, more robust features which reduces over-fitting. OK, add dropout, L2, the features became less wiggly, the net learns less noise and more useful patterns in the data but I cannot see how these regularized features are guaranteed to be uncorrelated. 2) Different activations: Yes tricky to do it in the languages that require vectorization (Python, R, Matlab) but no problem for those that don't (C/C++, Java, Fortran, Julia) as you can use loops.