Correlation as an evaluation metric for regression



I'm dealing with a regression prediction challenge where the evaluation metric is (pearson) correlation. However, I have the impression that this metric is kind of arbitrary. While I can keep the RMSE stable the correlation can have great variety.

Could someone please explain this metric and how to optimise for it?


Posted 2016-01-23T08:04:13.130

Reputation: 553

I'm slightly confused; if you don't think this is an appropriate metric, why are you using it? – Emre – 2016-01-24T06:37:42.213

@Emre because the challenge organizers defined it to be their evaluation metric – spore234 – 2016-01-24T09:40:34.477



The organizers might deem the direction of predicted change more important than the magnitude, i.e., it is more important that your prediction is high when the known value is high (and vice versa) than to get as close to the known value as possible. The measurements might be noisy anyway.

One, fairly robust way of optimizing for it would be by grid-searching for the local optimum, like in this QA.

However, you should also take note that algorithms tweak ther internal parameters when fitting according to some loss function. Some algorithms accept custom cost functions and derivatives, yet some implementations don't. Information-theoretic measureas are standard in classification while MSE in regression.

Theoretically, you should be able to tell your Random Forest (or another algorithm accordingly) what the optimal split is as a function of pearson correlation.


Posted 2016-01-23T08:04:13.130

Reputation: 2 982

I did a CV gridsearch for the hyperparameters of a random forest and my results are very unstable. Correlation ranges between 0.15 and 0.4, while RMSE is between 0.45 and 0.41 for the hyperparameters. When I predict on a hold-out set the correlation is far off the CV values. Hyperparameters that perform well in gridsearch perform badly on the holdout set and vice versa. – spore234 – 2016-01-25T07:44:18.103

Indeed, please see the updated answer. – K3---rnc – 2016-01-25T17:40:20.333

The problem is that regression minimizes a cost function (presumably using least squares in your case) and the cost function is not based on the correlation, but also includes adjustments to the y intercept. Your grid search is effectively interfering with your cost function minimization. If you are coding this yourself, I suggest choosing the Pearson correlation as the cost function within the algorithm. Thus the y intercept (w0 or bias weight) won't be fit and will reduce the bias of the problem. – AN6U5 – 2016-01-26T14:40:21.640