## Non-linear regression line fit

3

2

I performed a regression analysis with two datasets, each of which has size 50. One dataset is called Spatial % and the other Min values, and I wanted to check whether the two are correlated. I did the analysis in SPSS and the resulting scatterplot is as follows: I am not that much experienced but it seems to me that a line is not the perfect fit for this scatterplot. Would a power line fit better? Or what else do you suggest?

1Did any of the existing answers helped you solve the question? If yes, then please mark it as accepted :) – Dawny33 – 2015-11-19T17:45:29.793

You say "two different datasets" but then plot them as if they are one dataset with two features. Is that what you really want? Also, you say "want to find the correlation", but then never check the linear correlation, do you want to check the linear correlation before moving to nonlinear? Though it might not "look" like a line is a good fit, there are plenty of cases where points get stacked on top of one another so the distro is linear and its only the outliers that are nonlinear. Lots of posters seem to be jumping to nonlinear without fully vetting linear. I advise being disciplined. – AN6U5 – 2015-11-19T20:39:33.353

The dots you see have their x values coming from one dataset and the y values coming from another. How would you check if there is a regression line that could be fit, if you don't plot them this way? – FaCoffee – 2015-11-20T08:56:04.910

As for the linear correlation: I failed to imagine how a linear regression could even fit this scatterplot. Look at its shape: it suggests some exponential curve fits them best. This is also confirmed by the two answers that I got. You are right, there could be a partial linear regression which may serve the cause, but look at those that would become outliers: aren't they too many to "get rid" of them? – FaCoffee – 2015-11-20T08:58:02.487

I have tried to export the graph values, with limited success. Could you please share the data? – Laurent Duval – 2015-11-29T08:32:00.860

2

I tried to estimate a few of your data values from the scatter plot you provided.

Then I performed a power model regression and came up with

$y = (5.777 \cdot 10^{-16}) \cdot 1.404^{x}.$ The estimated values I used are below.

$(70, 0.01), (75, 0.012), (80, 0.015), (90, 0.025), (95, 0.075), (98, 0.15), (99, 0.20), (99.5, 0.25), (99.9, 0.32)$

Of course, your actual model will differ because you have the actual data set. I just eyeballed a few points so I could test the power fit.

1

You're right, a basic linear regression is unlikely to fit this data. You need some form of non-linear regression.

Basic forms of this (including exponential regression, mentioned by @Dawny33) can be found in most spreadsheet software, including Excel. Packages like scikit learn and others will allow for more flexibility.

1

Yes Linear Regression is not a nice fit for this problem.

Non-linear regression, as @jamesmf suggested, can be a nice option.

But, this looks like a nice fit for exponential regression.

The graph of exponential regression looks something like this: The Box-Cox transformation can also be used for fitting the plot.

I have taken a sample data set, and fitted a Box-Plot transformation, with relevant parameters for transforming it to look somewhat like the plot of your data: Sorry for the noise, as it is a quick and dirty implementation. But, yeah a Box Cox transformation should also be a nice way to fit.

R code for the above plot:

lambda = +9.6
plot(BoxCox(elec, lambda))


elec is a sample data set.

1Can the Box-Cox Transformation be categorized as an exponential transformation? – FaCoffee – 2015-11-12T14:47:18.610

1@FrancescoCastellani Yes, Box-Cox is also an option. I have edited my answer to include how it fits in, along with a quick implementation in R. (Sorry, I'm not experienced at SPSS) – Dawny33 – 2015-11-12T15:07:32.357

1

I suspect that your $x$ values, as they are percentages, lie in $[0,100[$. Regarding $y$ values, they seem positive. But many are very close to $0$. So I would first decide whether $y$s below a threshold should be put aside first as outliers, as they will have a huge influence on the first basic fits. You can reintroduce them afterward with robust fitting procedures.

An important question is: are $y$s bounded or not? The slope seems very steep, so you have to guess if the derivative is infinite at $x=100$, to help you chose models.

I believe a first idea is to perform a change of variable on the $x$ axis, with $x' = \frac{1}{{(100-x)}^\alpha}$ and try some $\alpha$ values, to see if clearer patterns appear.