Feature Selection in Linear Regression



I have a insurance dataset as given below. For which I need to build a model to calculate the charges.

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

But I am not sure whether the column "region" can be dropped. Is there any test can be performed to consider only significant variables?


Posted 2018-04-30T10:19:41.153

Reputation: 1 271

Do one hot encoding of the column – Aditya – 2018-04-30T13:19:04.747



First, transform your columns, then apply linear regression, but do you want to know about the influence of your features on your selected dependent variable?

Read this article: http://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/

It provides great insight at how interpret the coefficients given by the algorithm can be interpreted and further discussed or tweaked. Read it, then come here for further doubts, you're more than welcomed to do so.

Felipe Bormann

Posted 2018-04-30T10:19:41.153

Reputation: 391


Why don't you consider Gradient Boosting Decision Trees (GBDT) for Regression which you will find many Python implementation for (XGboost, LightGBM and CatBoost).

The good things about GBDTs (more relevant to your problem) are:

  • They have an intrinsic way to calculate feature importance (due to the way trees splits work .e.g Gini score and so on).
  • They can deal with categorical variables that you have (sex, smoke, region)
  • Also account for any possible correlations among your variables. Simple linear models fail to capture any correlations which could lead to overfitting.
  • There are many ways to regularize GBDTs, which may come very handy!

With GBDTs you only have to be careful with continuous variable, in your case the bmi variable, not to be artificially overruling your trees (trees have hard time dealing very continuous data). You can easily overcome this challenge by rounding up/down or binning your continuous variable or other methods.

If you have strong reasons to stick to linear regressions, maybe you could use LASSO which is a regularized linear regression that harshly penalizes (=0) the less important variables. People actually use LASSO for feature selection as well.


Posted 2018-04-30T10:19:41.153

Reputation: 3 728


The pacakge leaps in R has functions, like regsubsets, that do this. Essentially, what this functions do is they compare different models with criteria that you can choose and take the model that does the best according to the criterium chosen.

Please note that keeping the model with the highest $R^2$ is not a very good criterium, as models with more variables will always have higher $R^2$. The criteria that regsubsets can use are adjusted R^2 and Bayesian Information Criterion, among others.

David Masip

Posted 2018-04-30T10:19:41.153

Reputation: 5 101

Thanks for the answer. I will try to implement with the given package. However is there a similar package in python too ? – deepguy – 2018-04-30T10:49:28.487

This question addresses it: https://stackoverflow.com/questions/37624920/subset-regression-in-python-via-exhaustive-search

– David Masip – 2018-04-30T10:51:35.393

Apparently the best you can do is call R from Python. Sadly R has many statistical fancy things that python doesn't. – David Masip – 2018-04-30T10:52:03.317

Did I help answer your question? – David Masip – 2018-05-02T12:53:01.627