You might want to interpret your coefficients. That is, to be able to say things like "if I increase my variable $X_1$ by 1, then, on average and all else being equal, $Y$ should increase by $\beta_1$".

For your coefficients to be interpretable, linear regression assumes a bunch of things.

One of these things is no multicollinearity. That is, your $X$ variables should not be correlated against each other.

Another is Homoscedasticity. The errors your model commits should have the same variance, i.e. you should ensure the linear regression does not make small errors for low values of $X$ and big errors for higher values of $X$. In other words, the difference between what you predict $\hat Y$ and the true values $Y$ should be constant. You can ensure that by making sure that $Y$ follows a Gaussian distribution. (The proof is highly mathematical.)

Depending on your data, you may be able to make it Gaussian. Typical transformations are taking the inverse, the logarithm or square roots. Many others exist of course, it all depends on your data. You have to look at your data, and then do a histogram or run a normality test, such as the Shapiro-Wilk test.

These are all techniques to build an unbiased estimator. I don't think it has anything to do with convergence as others have said (sometimes you may also want to normalize your data, but that is a different topic).

Following the linear regression assumptions is important if you want to either interpret the coefficients or if you want to use statistical tests in your model. Otherwise, forget about it.

Applying the logarithm or normalizing your data, is also important because linear regression optimization algorithms typically minimize $\|\hat y - y\|^2$, so if you have some big $y$ outliers, your estimator is going to be VERY concerned about minimizing those, since it is concerned about the squared error, not absolute error. Normalizing your data is important in those case and this is why scikit-learn has a `normalize`

option in the LinearRegression constructor.

3

So residuals are Gaussian (and can be canceled out by averaging), the variance is stable, and to

– Emre – 2017-07-07T17:52:22.817preconditionthe optimizer to expedite convergence. https://en.wikipedia.org/wiki/Power_transformskewed = skewed[skewed > 0.75] skewed = skewed.index. What is the description of index ? what is the interpretation of 0.75 ? what difference do you see between standard deviation and skewness ? – Subhash C. Davar – 2020-08-05T09:23:36.860

It's not my code. skewed.index gives you the indices of the slice >0.75. The indices are nothing but row number and that's how you know which ones you want to preprocess. There was no explanation given in the link above so I though maybe someone would shed some light on why that particualr threshold and why log1p. I'm sure you'll figure it out too. – Abhijay Ghildyal – 2020-08-06T06:15:54.260