## Why do we convert skewed data into a normal distribution

24

14

I was going through a solution of the Housing prices competition on Kaggle (Human Analog's Kernel on House Prices: Advance Regression Techniques) and came across this part:

# Transform the skewed numeric features by taking log(feature + 1).
# This will make the features more normal.
from scipy.stats import skew

skewed = train_df_munged[numeric_features].apply(lambda x: skew(x.dropna().astype(float)))
skewed = skewed[skewed > 0.75]
skewed = skewed.index

train_df_munged[skewed] = np.log1p(train_df_munged[skewed])
test_df_munged[skewed] = np.log1p(test_df_munged[skewed])


I am not sure of what is the need for converting a skewed distribution into a normal distribution. Please, can someone explain in detail:

1. Why is this being done here? or How is this helpful?
2. How is this different from feature-scaling?
3. Is this a necessary step for feature-engineering? What is likely to happen if I skip this step?

3

So residuals are Gaussian (and can be canceled out by averaging), the variance is stable, and to precondition the optimizer to expedite convergence. https://en.wikipedia.org/wiki/Power_transform

– Emre – 2017-07-07T17:52:22.817

skewed = skewed[skewed > 0.75] skewed = skewed.index. What is the description of index ? what is the interpretation of 0.75 ? what difference do you see between standard deviation and skewness ? – Subhash C. Davar – 2020-08-05T09:23:36.860

It's not my code. skewed.index gives you the indices of the slice >0.75. The indices are nothing but row number and that's how you know which ones you want to preprocess. There was no explanation given in the link above so I though maybe someone would shed some light on why that particualr threshold and why log1p. I'm sure you'll figure it out too. – Abhijay Ghildyal – 2020-08-06T06:15:54.260

19

You might want to interpret your coefficients. That is, to be able to say things like "if I increase my variable $X_1$ by 1, then, on average and all else being equal, $Y$ should increase by $\beta_1$".

For your coefficients to be interpretable, linear regression assumes a bunch of things.

One of these things is no multicollinearity. That is, your $X$ variables should not be correlated against each other.

Another is Homoscedasticity. The errors your model commits should have the same variance, i.e. you should ensure the linear regression does not make small errors for low values of $X$ and big errors for higher values of $X$. In other words, the difference between what you predict $\hat Y$ and the true values $Y$ should be constant. You can ensure that by making sure that $Y$ follows a Gaussian distribution. (The proof is highly mathematical.)

Depending on your data, you may be able to make it Gaussian. Typical transformations are taking the inverse, the logarithm or square roots. Many others exist of course, it all depends on your data. You have to look at your data, and then do a histogram or run a normality test, such as the Shapiro-Wilk test.

These are all techniques to build an unbiased estimator. I don't think it has anything to do with convergence as others have said (sometimes you may also want to normalize your data, but that is a different topic).

Following the linear regression assumptions is important if you want to either interpret the coefficients or if you want to use statistical tests in your model. Otherwise, forget about it.

Applying the logarithm or normalizing your data, is also important because linear regression optimization algorithms typically minimize $\|\hat y - y\|^2$, so if you have some big $y$ outliers, your estimator is going to be VERY concerned about minimizing those, since it is concerned about the squared error, not absolute error. Normalizing your data is important in those case and this is why scikit-learn has a normalize option in the LinearRegression constructor.

3

The skewed data here is being normalised by adding one(one added so that the zeros are being transformed to one as log of 0 is not defined) and taking natural log. The data can be nearly normalised using the transformation techniques like taking square root or reciprocal or logarithm. Now, why it is required. Actually many of the algorithms in data assume that the data science is normal and calculate various stats assuming this. So the more the data is close to normal the more it fits the assumption.

4The algorithms here are gradient boosting and lasso regression. I think this answer would be more helpful if it could show how it is (or is not) relevant to these two algorithms specifically. – oW_ – 2017-07-07T16:58:41.250

From my point of view, when a model is trained whether they are linear regression or some Decision Tree (robust to outlier), skew data makes a model difficult to find a proper pattern in the data is the reason we have to make a skew data into normal or Gaussian one. – Goldi Rana – 2019-10-29T08:44:49.913

1

Because data science is just statistics at the end of the day, and one of the key assumptions of statistics is the Central Limit Theorem. So this step is being done because some subsequent step uses stats techniques that rely on it.

3A theorem is not an assumption. The Central Limit Theorem in fact guarantees that the average of independent random variables is approximately normally distributed even when the individual random variables are not normally distributed. – Elias Strehle – 2018-02-20T09:24:02.220

2This is one extremely flawed chain of reasoning. It's like: "

• I've seen people peel apples before eating them. Why?

• Oh, that's because apples are fruit and one of the key fruit is orange and you always peel an orange!".

• < – ayorgo – 2019-06-29T16:13:32.480