Why is duplicating inputs bad?


I am trying to predict an output value based on several continuously-valued inputs using a regression model.

I am not sure what approach is appropriate to scale/transform the input data for the regression. Let's just pretend that it is unlabeled data.

My most naive approach would be to add each input multiple times:

  • input
  • log(input)
  • sqrt(input)

and then let the regression model worry about finding which flavor of each input (if any) is significant.

What are the risks with using this approach?

William Entriken

Posted 2017-07-21T21:15:44.587

Reputation: 333



The issue with building a regression model on all 3 of these is that you are potentially introducing multicollinearity into the model. Although log(input) and sqrt(input) are not linear functions of the input a quick test (using Matlab) shows they are still highly correlated (depending on the range)

corrcoef(input, input_log) %->0.8549
corrcoef(input, input_rt) %->0.9779
corrcoef(input_sq, input_rt) %0>0.9407

Multicollinearity will not reduce the predictive power of the model, but it will make the regression coefficients of these variables difficult to interpret since small adjustments in the data may cause the regression model to switch which of the 3 input variations it finds significant. It also adds unnecessary complexity to the model. I would train the models with the 3 separately if you want to observe the effects.

Additionally, applying things like sqrt and log to your output can be useful if the data currently can't be described by a linear relationship.


Posted 2017-07-21T21:15:44.587

Reputation: 503

Beautiful, thanks for adding this word to my vocabulary! – William Entriken – 2017-07-26T00:54:10.657