## Correlation as an evaluation metric for regression

2

0

I'm dealing with a regression prediction challenge where the evaluation metric is (pearson) correlation. However, I have the impression that this metric is kind of arbitrary. While I can keep the RMSE stable the correlation can have great variety.

Could someone please explain this metric and how to optimise for it?

I'm slightly confused; if you don't think this is an appropriate metric, why are you using it? – Emre – 2016-01-24T06:37:42.213

@Emre because the challenge organizers defined it to be their evaluation metric – spore234 – 2016-01-24T09:40:34.477

1

The organizers might deem the direction of predicted change more important than the magnitude, i.e., it is more important that your prediction is high when the known value is high (and vice versa) than to get as close to the known value as possible. The measurements might be noisy anyway.

One, fairly robust way of optimizing for it would be by grid-searching for the local optimum, like in this QA.

However, you should also take note that algorithms tweak ther internal parameters when fitting according to some loss function. Some algorithms accept custom cost functions and derivatives, yet some implementations don't. Information-theoretic measureas are standard in classification while MSE in regression.

Theoretically, you should be able to tell your Random Forest (or another algorithm accordingly) what the optimal split is as a function of pearson correlation.

I did a CV gridsearch for the hyperparameters of a random forest and my results are very unstable. Correlation ranges between 0.15 and 0.4, while RMSE is between 0.45 and 0.41 for the hyperparameters. When I predict on a hold-out set the correlation is far off the CV values. Hyperparameters that perform well in gridsearch perform badly on the holdout set and vice versa. – spore234 – 2016-01-25T07:44:18.103