I'm not sure what your boss thinks "more predictive" means. Many people *incorrectly* believe that lower $p$-values mean a better / more predictive model. **That is not necessarily true** (this being a case in point). However, independently sorting both variables beforehand will guarantee a lower $p$-value. On the other hand, we can assess the predictive accuracy of a model by comparing its predictions to new data that were generated by the same process. I do that below in a simple example (coded with `R`

).

```
options(digits=3) # for cleaner output
set.seed(9149) # this makes the example exactly reproducible
B1 = .3
N = 50 # 50 data
x = rnorm(N, mean=0, sd=1) # standard normal X
y = 0 + B1*x + rnorm(N, mean=0, sd=1) # cor(x, y) = .31
sx = sort(x) # sorted independently
sy = sort(y)
cor(x,y) # [1] 0.309
cor(sx,sy) # [1] 0.993
model.u = lm(y~x)
model.s = lm(sy~sx)
summary(model.u)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.021 0.139 0.151 0.881
# x 0.340 0.151 2.251 0.029 # significant
summary(model.s)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.162 0.0168 9.68 7.37e-13
# sx 1.094 0.0183 59.86 9.31e-47 # wildly significant
u.error = vector(length=N) # these will hold the output
s.error = vector(length=N)
for(i in 1:N){
new.x = rnorm(1, mean=0, sd=1) # data generated in exactly the same way
new.y = 0 + B1*x + rnorm(N, mean=0, sd=1)
pred.u = predict(model.u, newdata=data.frame(x=new.x))
pred.s = predict(model.s, newdata=data.frame(x=new.x))
u.error[i] = abs(pred.u-new.y) # these are the absolute values of
s.error[i] = abs(pred.s-new.y) # the predictive errors
}; rm(i, new.x, new.y, pred.u, pred.s)
u.s = u.error-s.error # negative values means the original
# yielded more accurate predictions
mean(u.error) # [1] 1.1
mean(s.error) # [1] 1.98
mean(u.s<0) # [1] 0.68
windows()
layout(matrix(1:4, nrow=2, byrow=TRUE))
plot(x, y, main="Original data")
abline(model.u, col="blue")
plot(sx, sy, main="Sorted data")
abline(model.s, col="red")
h.u = hist(u.error, breaks=10, plot=FALSE)
h.s = hist(s.error, breaks=9, plot=FALSE)
plot(h.u, xlim=c(0,5), ylim=c(0,11), main="Histogram of prediction errors",
xlab="Magnitude of prediction error", col=rgb(0,0,1,1/2))
plot(h.s, col=rgb(1,0,0,1/4), add=TRUE)
legend("topright", legend=c("original","sorted"), pch=15,
col=c(rgb(0,0,1,1/2),rgb(1,0,0,1/4)))
dotchart(u.s, color=ifelse(u.s<0, "blue", "red"), lcolor="white",
main="Difference between predictive errors")
abline(v=0, col="gray")
legend("topright", legend=c("u better", "s better"), pch=1, col=c("blue","red"))
```

The upper left plot shows the original data. There is some relationship between $x$ and $y$ (viz., the correlation is about $.31$.) The upper right plot shows what the data look like after independently sorting both variables. You can easily see that the strength of the correlation has increased substantially (it is now about $.99$). However, in the lower plots, we see that the distribution of predictive errors is much closer to $0$ for the model trained on the original (unsorted) data. The mean absolute predictive error for the model that used the original data is $1.1$, whereas the mean absolute predictive error for the model trained on the sorted data is $1.98$—nearly twice as large. That means the sorted data model's predictions are much further from the correct values. The plot in the lower right quadrant is a dot plot. It displays the differences between the predictive error with the original data and with the sorted data. This lets you compare the two corresponding predictions for each new observation simulated. Blue dots to the left are times when the original data were closer to the new $y$-value, and red dots to the right are times when the sorted data yielded better predictions. There were more accurate predictions from the model trained on the original data $68\%$ of the time.

The degree to which sorting will cause these problems is a function of the linear relationship that exists in your data. If the correlation between $x$ and $y$ were $1.0$ already, sorting would have no effect and thus not be detrimental. On the other hand, if the correlation were $-1.0$, the sorting would completely reverse the relationship, making the model as inaccurate as possible. If the data were completely uncorrelated originally, the sorting would have an intermediate, but still quite large, deleterious effect on the resulting model's predictive accuracy. Since you mention that your data are typically correlated, I suspect that has provided some protection against the harms intrinsic to this procedure. Nonetheless, sorting first is definitely harmful. To explore these possibilities, we can simply re-run the above code with different values for `B1`

(using the same seed for reproducibility) and examine the output:

`B1 = -5`

:

```
cor(x,y) # [1] -0.978
summary(model.u)$coefficients[2,4] # [1] 1.6e-34 # (i.e., the p-value)
summary(model.s)$coefficients[2,4] # [1] 1.82e-42
mean(u.error) # [1] 7.27
mean(s.error) # [1] 15.4
mean(u.s<0) # [1] 0.98
```

`B1 = 0`

:

```
cor(x,y) # [1] 0.0385
summary(model.u)$coefficients[2,4] # [1] 0.791
summary(model.s)$coefficients[2,4] # [1] 4.42e-36
mean(u.error) # [1] 0.908
mean(s.error) # [1] 2.12
mean(u.s<0) # [1] 0.82
```

`B1 = 5`

:

```
cor(x,y) # [1] 0.979
summary(model.u)$coefficients[2,4] # [1] 7.62e-35
summary(model.s)$coefficients[2,4] # [1] 3e-49
mean(u.error) # [1] 7.55
mean(s.error) # [1] 6.33
mean(u.s<0) # [1] 0.44
```

92

But my manager says he gets "better regressions most of the time" when he does this.Oh god... – Jake Westfall – 2015-12-07T17:30:06.5835I'm having a hard time convincing him. I drew a picture showing how the regression line is completely different. But he seems to like results he sees. I'm trying to tell him it is a coincidence. FML – arbitrary user – 2015-12-07T18:34:32.913

1I'm really so embarrassed that I even asked this but I can't seem to convince him with counterexamples or math of any kind. He "has an intuition" that he can do this with his particular data set. – arbitrary user – 2015-12-07T18:43:17.450

41There's certainly no reason for

youto feel embarrassed! – Jake Westfall – 2015-12-07T19:10:27.85723"Science is whatever we want it to be." - Dr. Leo Spaceman. – Sycorax – 2015-12-07T19:31:42.703

2If the regression is being used to predict on new data, it's easy to see by holding out a test set that this will make the regression much

lesspredictive – I don't have time to construct an example right now, but that may be more convincing. – Dougal – 2015-12-07T19:47:40.910@Dougal, I essentially do that below. (Nb, I wasn't sure exactly what would be the "correct" k-fold CV for this case, so I just used completely new data from the same DGP.) – gung – 2015-12-07T19:52:08.163

5In addition to excellent points already made: If this is such a good idea, why isn't it in courses and texts? – Nick Cox – 2015-12-07T20:01:38.470

1@NickCox because nobody dared to point this brilliant idea :) – Tim – 2015-12-07T20:04:28.797

2@Tim I am being partly frivolous and I imagine you are too. But results from this method wouldn't be replicable unless it was explained. People would assume that the advocate was incompetent or a cheat. Actually, that's not ruled out here either. – Nick Cox – 2015-12-07T20:07:29.473

7Who in the world do you work for? – dsaxton – 2015-12-07T20:11:03.460

39This idea has to compete with another I have encountered: If your sample is small, just bulk it up with several copies of the same data. – Nick Cox – 2015-12-07T20:11:35.637

4@dsaxton We're all curious, but this is one case where the anonymity of the OP is likely to be crucial. – Nick Cox – 2015-12-07T20:12:30.263

33You should tell your boss you have a better idea. Instead of using the actual data just generate your own because it'll be easier to model. – dsaxton – 2015-12-07T20:13:09.737

@gung Oops, I skimmed your answer and didn't notice the predictive error histograms. :) – Dougal – 2015-12-07T20:36:59.520

The manager should try nonparametric stats with the approach and see if the results "improve" even more (edit: intense sarcasm implied). – rbatt – 2015-12-07T21:23:07.093

6A very simple counter example (beyond the randomized set) would be a data set where X_i = -k Y_i. Sorting the values would result in X_i = k Y_i which is completely incorrect – Dancrumb – 2015-12-07T21:35:18.883

Actually I can conceive some situations where this

mightdo reasonably well -- e.g. when there are unmodelled predictor variables of just the right sort (however, I seriously doubt this will be the case). There may be some traction with your boss in investigating the out-of-sample properties of this approach. For example, how does it perform (compared to ordinary regression) when you do cross-validation? – Glen_b – 2015-12-07T23:37:36.0702"I will probably begin looking for other jobs soon." you should look for other jobs now! – shadowtalker – 2015-12-08T01:43:20.907

11That's not a regression, it's a Q-Q plot :P – naught101 – 2015-12-08T06:13:57.923

3How is it that people who are clearly incompetent end up being employed and in charge? What is their secret? – gerrit – 2015-12-08T14:48:06.580

This is a great question because it really gets to how to convince somebody of something when they don't fully understand what is going on. I am not convinced that the manager will be convinced by pictures or notation (I figure his counterarguments would always be "but why

can'tyou make X and Y independent?"). I'd almost go so far as thinking appeal to (technical) authority would be appropriate here (the expert, the one doing the work, has more experience with these numbers than the manager). – Mitch – 2015-12-08T15:15:44.5706Hi @arbitraryuser. Great question with many good answers. Your edit is telling re: your manager becoming frustrated. You might want to see our sister site Workplace.SE about approaches to convince your boss on your point. – None – 2015-12-08T16:38:06.457

1People like to deceive themselves, and often become irritable when that deception is noted. A really important skill to learn in your career is how to gently counter that deception (try channeling the best elementary school teacher you every knew). Another important skill is identify when that deception is unshakable and avoiding those situations... – Zach – 2015-12-08T19:55:53.760

I am too lazy myself to do it, but

– None – 2015-12-08T23:46:49.767`R`

has a repository of data sets that I think could make this point much stronger: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html2Instead of sorting the X values, why not use two copies of the Y values? That is, instead of using <Xi,Yi> use <Yi,Yi>! Guaranteed to get a high R^2 value or your money back! – immibis – 2015-12-09T22:48:00.380

you should also sort the bits in each Xi and Yi value for even "better regressions". – Pierre D – 2015-12-10T05:46:36.677

You should quit your job immediately. Your company is probably doomed. – Sam Lichtenstein – 2015-12-19T06:21:44.640

The problem is that it finds correlation in the artificially sorted data, not the actual data. It can't predict the next value y_i given x_i; all it predicts is the y_i corresponding to x_i

after reordering. This is totally pointless. The points in your boss's graph don't correspond to actual data points. – David Knipe – 2015-12-20T16:07:33.973There must be a Dilbert strip about your manager. Find it, print it, and leave it on your desk the you leave. P.S. http://www.de.ufpe.br/~cribari/dilbert_2.gif

– Fr. – 2015-12-20T16:09:28.223@NickCox I agree with the sentiment that there's sense in protecting the OP's anonymity, but I think it would be doing humanity an enormous service for this manager to be unmasked and publicly shamed. Public shaming will be much more likely to convince him he's wrong than simulations in R. I think it's important that the world knows never to put this manager in charge of anything involving numbers ever again. – David M. Perlman – 2016-03-03T17:50:08.277

@DavidM.Perlman In turn I agree with the sentiment. But pick your favourite case where you think the people you disagree with are just wrong, period, no discussion necessary, e.g. the opposite site from you on global warming, immigration, whatever. Public criticism just entrenches attitudes. This person has already demonstrated immunity to statistical reasoning. – Nick Cox – 2016-03-03T18:09:55.657

"

sigh... because that's not the data." – Mitch – 2016-12-25T18:20:30.470