## Is this a good practice of feature engineering?

10

2

I have a practical question about feature engineering... say I want to predict house prices by using logistic regression and used a bunch of features including zip code. Then by checking the feature importance, I realize zip is a pretty good feature, so I decided to add some more features based on zip - for example, I go to census bureau and get the average income, population, number of schools, and number of hospitals of each zip. With these four new features, I find the model performances better now. So I add even more zip-related features... And this cycle goes on and on. Eventually the model will be dominated by these zip-related features, right?

My questions:

1. Does it make sense doing these in the first place?
2. If yes, how do I know when is a good time to stop this cycle?
3. If not, why not?

6

If you can keep adding new data (based on a main concept such as area i.e. the ZIP code) and the performance of your model improves, then it is of course allowed... assuming you only care about the final result.

There are metrics that will try to guide you with this, such as the Akaike Information Criterion (AIC) or the comparable Bayesian Information Criterion (BIC). These essentially help to pick a model based on its performance, being punished for all additional parameters that are introduced and that must be estimated. The AIC looks like this:

$${\displaystyle \mathrm {AIC} =2k-2\ln({\hat {L}})}$$

where $k$ is the number of parameters to be estimated, i.e. number of features you apply, because each one will have one coefficient in your logistic regression. $\hat{L}$ is the maximum value of the Maximum Likelihood (equivalent to the optimal score). BIC simply uses $k$ slightly differently to punish models.

These criteria can help tell you when to stop, as you can try models with more and more parameters, and simply take the model which has the best AIC or BIC value.

If you still have other features in the model, which are not related to the ZIP, they could potentially become overwhelmed - that depends on the model you use. However, they may also explain things about the dataset which simply cannot be contained in the ZIP information, such as a house's floor area (assuming this is relatively independent from ZIP code).

In this case you might compare these to something like Principal Component Analysis, where a collection of features explain one dimention of the variance in data set, while other features explain another dimension. So no matter how many ZIP-related features you have, you may never explain importance of floor area.

7

1) Yes, it makes sense. Trying to create features manually will help the learners (i.e. models) to graspe more information from the raw data because the raw data is not always in a form that is amenable to learning, but you can always construct features from it that are. The feature you are adding are based on one feature. This is common. However, your learner, logistic regression, is sensitive to multi-collinearity. You need to be careful on which feature and how many features to add. Otherwise, you model may overfit.

2) Since you are using a logistic regression, you can always use AIC or perform a statistical significance test, like chi-square test (testing the goodness of fit) before adding new structure, to decide whether the distribution of the response really is different with and without this structure. This is particularly useful when your data is scarce. Another way is to add penalty term to your model. For example, logistic lasso regression.

3) Keep adding new features is not always a good idea. Be careful with the curse of high-dimensionality. When adding new feature, you are actually adding a new dimension on your data. Naively, one might think that gathering more features never hurts, since at worst they provide no new information about the class. But in fact their benefits may be outweighed by the curse of dimensionality. I hope Useful things to know about machine learning session6 could help.

Is @user3768495 evaluating the performance of the model out-of-sample using e.g. cross-validation? If so, multi-collinearity should not be a problem and he should not worry about overfitting as he will get an indication of overfitting through the validation performance decreasing. – rinspy – 2018-06-14T09:47:43.230

@rinspy overfitting has many faces. Involving a validation set can help avoiding overfitting but cannot solve the problem. For example, the inconsistent distribution between training data (which is split into training set and validation set) and real population. Even model performs well in the training data, it may not be generalized to real world situation. The reference from my answer also talked about overfitting. – Fansly – 2018-06-14T15:36:55.287

True, but avoiding multicollinearity will not help with 'overfitting' arising from covariate shifts. I am just saying that multicollinearity is likely not an issue if he is interested in building a predictive (and not a descriptive) model. – rinspy – 2018-06-15T08:26:24.983

My concept about overfitting is about when a model failing to generalized to a new dataset, not wihin the training data. Please see this

– Fansly – 2018-06-15T15:10:28.537

4

Usually, the richer the features the better.

One thing to keep in mind, however, regressions, in general, do not work well with data that is highly correlated (multicollinearity). When you expand your features this way, it is something you might want to keep in mind.

There is a lot of information on this very topic (and potential ways to mitigate), just google regression and multicollinearity.

In short,

1. Yes. Most definitely.
2. @n1k31t4 has some good suggestions. Feel free to generate what features you think will improve your model, then you can use techniques such as PCA and other feature selection techniques to limit yourself to what's important.
3. The other thing to consider is how practical it is as in effort vs. result.

0

Features are the information of your model. The more the information, the better will it be able to perform and predict. The lesser of it, the harder to predict values. So the short naser is yes. It is always worth to have as many features as possible. There is always a limit to this though since an information overload too might burn your processor, so be careful of how many features are being engineered. Also, unnecessary features only add to the burnout, so it's always good practise to clean up certain features. The whole data preprocessing phase is about that.

The first answer has some good details about it. As far as stopping a cycle is concerned, well there are several measures and factors that you need to be aware of to check where your model has stopped performing better and those are measures like the RMSE. A simple example will be using xgboost regression on your data and specifying the number of cycles. Run the model and you will get the RMSE for each cycle. It will decrease to a limit after which you'll be able to deduce that the model has plateaued after a certain cycle. This is how model tuning and optimisation works.