## Generate predictions that are orthogonal (uncorrelated) to a given variable

8

3

I have an X matrix, a y variable, and another variable ORTHO_VAR. I need to predict the y variable using X, however, the predictions from that model need to be orthogonal to ORTHO_VAR while being as correlated with y as possible.

I would prefer that the predictions are generated with a non-parametric method such as xgboost.XGBRegressor but I could use a linear method if absolutely necessary.

This code:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
TARGET = 'target'
PRED = 'yhat'

# Create regression dataset with two correlated targets
X, y = make_regression(n_features=20, random_state=245, n_targets=2)
indep_vars = ['var{}'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df[PRED] = xgb.predict(df[indep_vars])

# Correlation should be low or preferably zero
pred_corr_w_ortho = df.corr().abs()[PRED][ORTHO_VAR]
assert pred_corr_w_ortho < 0.01, "Correlation score: {0} is superior to the given threshold.".format(pred_corr_w_ortho)


Returns this:

---------------------------------------------------------------------------
AssertionError
1 pred_corr_w_ortho = df.corr().abs()[PRED][ORTHO_VAR]
----> 2 assert pred_corr_w_ortho < 0.01, "Correlation score: {0} is superior to the given threshold.".format(pred_corr_w_ortho)

AssertionError: Correlation score: 0.5895885756753665 is superior to the given threshold.


...and I would like something that maintains as much predictive accuracy as possible while remaining orthogonal to ORTHO_VAR

2What is the correlation of TARGET with ORTHO_VAR? – Esmailian – 2019-04-15T04:26:42.873

Good question. They are indeed correlated (let’s say 50%) the predictions will likely suffer in terms of accuracy by being made orthogonal. – Chris – 2019-04-15T10:39:55.087

5

This requirement can be satisfied by adding sufficient noise to predictions $$\hat{y}$$ to decorrelate them from orthogonal values $$v$$. Ideally, if $$\hat{y}$$ is already decorrelated from $$v$$, no noise would be added to $$\hat{y}$$, thus $$\hat{y}$$ would be maximally correlated with $$y$$.

Mathematically, we want to create $$\hat{y}'=\hat{y}+\epsilon$$ from $$\epsilon \sim \mathcal{N}(0, \sigma_{\epsilon})$$, to satisfy $$r_{\hat{y}'v} = \frac{\sigma_{\hat{y}'v}}{\sigma_{\hat{y}'}\sigma_{v}} < \delta$$ for arbitrary threshold $$\delta$$. Now, lets expand this inequality to find a lower-bound for std of noise $$\epsilon$$, i.e. $$\sigma_{\epsilon}$$. \begin{align*} \sigma_{\hat{y}'}^2&=\sigma_{\hat{y}}^2 + \sigma_{\epsilon}^2,\\ \sigma_{\hat{y}'v}&={\Bbb E}\left[(\hat{y}+\epsilon - \mu_{\hat{y}} - \overbrace{\mu_{\epsilon}}^{=0})(v-\mu_{v})\right]\\ &={\Bbb E}\left[(\hat{y} - \mu_{\hat{y}})(v-\mu_{v})\right]+\overbrace{{\Bbb E}\left[\epsilon(v-\mu_{v})\right]}^{=0}&\\ &=\sigma_{\hat{y}v},\\ r_{\hat{y}'v} &= \frac{\sigma_{\hat{y}'v}}{\sigma_{\hat{y}'}\sigma_{v}} =\frac{\sigma_{\hat{y}v}}{\sigma_{v} \sqrt{\sigma_{\hat{y}}^2+\sigma_{\epsilon}^2}} < \delta\\ &\Rightarrow \sigma_{\hat{y}}\sqrt{\left(\frac{r_{\hat{y}v}}{\delta}\right)^2 - 1} < \sigma_{\epsilon} \end{align*}

Since all the variables in the left side of inequality can be calculated, we can sample noises from $$\mathcal{N}(0, \sigma_{\epsilon})$$ and add them to $$\hat{y}$$ to satisfy the original inequality.

Here is a code that does the exact same thing:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

ORTHO_VAR = 'ortho_var'
IND_VARNM = 'indep_var'
TARGET = 'target'
CORRECTED_VARNM = 'indep_var_fixed'

seed = 245
# Create regression dataset with two correlated targets
X, y = make_regression(n_samples=10000, n_features=20, random_state=seed, n_targets=2)
indep_vars = ['var{}'.format(i) for i in range(X.shape[1])]

# Pull into dataframe
df = pd.DataFrame(X, columns=indep_vars)
df[TARGET] = y[:, 0]
df[ORTHO_VAR] = y[:, 1]

# Fit a model to predict TARGET
xgb = XGBRegressor(n_estimators=10)
xgb.fit(df[indep_vars], df[TARGET])
df['yhat'] = xgb.predict(df[indep_vars])

delta = 0.01

# std of noise required to be added to y_hat to bring the correlation
# of y_hat with ORTHO_VAR below delta
std_y_hat = np.std(df['yhat'], ddof=1)
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
std_noise_lower_bound = std_y_hat * np.sqrt((corr_y_hat_ortho_var / delta)**2 - 1.0)
std_noise = max(0, std_noise_lower_bound) + 1
print('delta: ', delta)
print('std_y_hat: ', std_y_hat)
print('corr_y_hat_target: ', corr_y_hat_target)
print('corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
print('std_noise_lower_bound: ', std_noise_lower_bound)
print('std_noise: ', std_noise)

np.random.seed(seed)
noises = np.random.normal(0, std_noise, len(df['yhat']))
noises -= np.mean(noises)  # remove slight deviations from zero mean
print('noise_samples: mean:', np.mean(noises), ', std: ', np.std(noises))
df['yhat'] = df['yhat'] + noises

# measure new correlation
corr_y_hat_ortho_var = np.corrcoef(df['yhat'], df[ORTHO_VAR])[1, 0]
corr_y_hat_target = np.corrcoef(df['yhat'], df[TARGET])[1, 0]
print('new corr_y_hat_target: ', corr_y_hat_target)
print('new corr_y_hat_ortho_var: ', corr_y_hat_ortho_var)
# Correlation should be low or preferably zero
assert corr_y_hat_ortho_var < delta, corr_y_hat_ortho_var
assert -delta < corr_y_hat_ortho_var, corr_y_hat_ortho_var


which outputs:

delta:  0.01
std_y_hat:  69.48568725585938
corr_y_hat_target:  0.8207672834857673
corr_y_hat_ortho_var:  0.7663936356880843
std_noise_lower_bound:  5324.885500165032
std_noise:  5325.885500165032
noise_samples: mean: 1.1059455573558807e-13 , std:  5373.914830034988
new corr_y_hat_target:  -0.004125016071865934
new corr_y_hat_ortho_var:  -0.000541131379457552


You can experiment with other deltas. By comparing std_y_hat with std_noise_lower_bound, you can see that a huge noise must be added to $$\hat{y}$$ to decorrelate it from $$v$$ bellow $$0.01$$, which dramatically decolerates $$\hat{y}$$ from $$y$$ too.

Note: Assertion might fail for too small thresholds $$\delta$$ due to insufficient sample count.

Is the lack of correlation between yhat and the target variable related to the high correlation between the two target variables (e.g. my fault)? Ideally we would want new corr_y_hat_target to be as high as possible with new corr_y_hat_ortho_var to be as low as possible – Chris – 2019-04-15T13:14:24.587

@Chris Indirectly yes. If target had a low correlation with orth, y_hat (which has a high correlation with target) would also had a low correlation with orth. As a result, a low noise would have been added to y_hat and its correlation with target would have changed slightly. – Esmailian – 2019-04-15T13:20:09.040