Discarding correlation among inputs in a neural network


I am working on a problem with 4 inputs and 1 continuous output variable. The sum of all values of the 4 input variables is always 1.


So, they are correlated.

My question is: should I use all 4 variables for neural network training? Or, should I use any 3 of them to get rid of correlation? Is there any problem if I use all 4?

Saptarshi Roy

Posted 2018-05-18T07:12:19.217

Reputation: 387

How about this answer at stats.stackexchange? https://stats.stackexchange.com/questions/232534/does-correlated-input-data-lead-to-overfitting-with-neural-networks

– TwinPenguins – 2018-05-18T07:18:24.243

So, according to this answer, I can use all 4 inputs without any problem. That is ok. But do I get the same answer if I use only 3 of them? – Saptarshi Roy – 2018-05-18T07:51:14.073

Well, how are you going to choose three of four variables? Something like PCA? If yes do try, but you need some sort of way of selecting the subset of features, or maybe random subsampling of features? I would first take all and closely observe the validation curve and do some regularizations first! – TwinPenguins – 2018-05-18T08:46:08.597



You could try some feature selection techniques if you want to choose 3.

That said, have you tested the data for multicollinearity? Maybe you have but I don't think that a1+a2+a3+a4=1 implies a high degree of correlation.

To answer your question though, it really doesn't hurt to try and train a neural net with the four variables. If one of the inputs is unnecessary, the NN training may well set the weight on one to close to zero without you having to choose one.


Posted 2018-05-18T07:12:19.217

Reputation: 898