Repeated features in Neural Networks with tabular data

2

When using algorithms like linear regression or least-squares methods, having repeated or highly correlated features can be harmful for the model. For tree based models, they are generally not too strongly affected by highly correlated features. There are no numeric stability issues as with least squares.

But what happens with Neural Networks? Most of the literature on NN is made for images, signal and there is not too much about tabular data.

Having repeated features in a tabular data Neural Network model, does it harm the accuracy? Or NN are able to select features?

Carlos Mougan

Posted 2020-11-08T21:45:12.337

Reputation: 4 420

Answers

1

Strictly theoretically it makes no difference on accuracy.

Here is why: We already know mathematically that NN can approximate any function. So lets say that we have Input X. X is highly correlated, than we can apply a decorrelation technique out there. Main Thing is, you get X` that has different numerical representation. Most likely more difficult for NN to learn to map to Outputs y. But still in Theory you can Change the architecure, Train for longer and you can still get the same Approximation, i.e. Accuracy.

Now, Theory and Praxis are same in Theory but different in Praxis, and I suspect that this Adjustments of Architecture etc will be much more costly in reality depending on the dataset.

Noah Weber

Posted 2020-11-08T21:45:12.337

Reputation: 4 932

Seems reasonable. At some point the NN can just never activate the neuron coming from that feature. but it's true that it will make the model more complex – Carlos Mougan – 2020-11-12T08:18:08.137

1

From experience on using NN on tabular data, having too much variable doesn't seem to directly hurt statistical performance that much. However it has a lot of impact on memory usage, calculation time and explainability of the model. Reducing memory usage and calculation time allows to calibrate more models (more random initialisations) and build better ensembles. In turn that allows for slightly better performance, and more importantly for models that are more stable (i.e. performance don't depend on random initialisation). Depending on the application and who is going to use the model (the data scientist or someone operationnal), explainability might be the main driver for feature selection. (Model stability often imply explainability stability too).

Outside of careful Exploratory Data Analysis / a priori expert based selection, the most practical approach for variable selection in NN is to add regularisation to your network calibration process. Namely, the $L1$ penalisation, by tending to reduce weights to 0 would act as feature selection. It might require to do some hyper-parameter tuning (calibrate multiple NN and see which value is better). The parallel use of others regularisation techniques like drop-outs, generaly help application of weight regularisation and allow for sturdier models.

There seems to be some ongoing work on pruning (removing connections / neurons) that seems to work similarly and achieve good results. Intuitively it should work better as it will adapt the NN architecture. Not sure those techniques are implemented in any popular library.

Another approach is to work a posteriori. With some feature importance you can remove variables that weren't useful overall. You might even do that iteratively... but this require a lot of time and work.

To be honest, those approaches seems to work to remove some weigths / non informative variables locally, but I am not sure there is a garanty that they would perfectly remove a duplication of a meaningfull feature like a tree technique would by selecting one of them. Regarding the question of duplicated meaningfull feature I tried to do some work on a posteriori importance to check If I could find them by looking at correlated importance, but got nothing really pratical / generalisable to linear dependance between more than 2 variables. So the real answer to your question might be a torough multivariate EDA to remove variables that are too correlated...

For a general solution there seems to be some ongoing work on adding variable selection gates before the main model (see here for example: Feature Selection using Stochastic Gates), but I haven't had the occasion to test something like this yet.

lcrmorin

Posted 2020-11-08T21:45:12.337

Reputation: 1 874

Thanks for the answer. By curiosity, would you say that a Neural Network with 50 features instead of 80 is more explainable? – Carlos Mougan – 2020-11-10T11:07:46.520

Feature reduction was very important to me to get from 1000+ to below 100 (mostly expert based + EDA + some a posteriori feature importance). L2 regularisation (+ drop out) seems important too, but more for model / explainability stability. Below 100 features it becomes about user interface and story telling : you can always group your variables by 5-6 main themes and only show the top 5 in absolute impact. Combined with creating baseline by categories it seems enough for users to get a picture of what is happening. – lcrmorin – 2020-11-10T12:12:13.430