Blind feature engineering

3

1

I received a dataset for analysis that had ~100 numeric columns with anonymous column names($X1$, $X2$, $X3$, etc...) and asked to do a binary classification. My resulting classification algorithm using a SVM had good accuracy (> 95%), but I was unable to do much in the way of feature engineering or feature generation other than the standard scaling, null-value replacement, etc, since I had no intuition about the columns.

Is there any standard logic as to how to do some sort of automated feature generation, i.e. some simple mathematical combinations of various columns to create new, useful features? Does this sort of thing have any mathematical basis for linear or tree-based models? Or is feature engineering only really meaningful when one has intuition based on the column names...

shwan

Posted 2019-07-30T03:15:46.577

Reputation: 153

You can try analyzing the feature importance of the cols and then try to use statistics and decipher the meaning of the underlying cols – Aditya – 2019-07-30T05:35:12.227

1

Maybe relevant to your question http://www.orges-leka.de/automatic_feature_engineering.html The method is based on Bourgain Embedding.

– None – 2019-08-31T15:42:06.497

95% accuracy sounds good, but what's your class size distribution? – Itamar Mushkin – 2020-06-30T14:23:18.570

Answers

1

I think your question has multiple answers. Let's start with a tool created with this purpose named featuretools. You can read more about it on the excellent documentation, but just to give you an intuition, think about a dataset that is on transaction level, that is, you have one row a new transaction, and then you want to make predictions based on customer level, you will create a series of features based on transactions, for example:

  • mean, max, min, sum, std of transactions values
  • number of transactions
  • number of unique products bought
  • ...

This tool will automate all these tasks based on entity sets and relationships between your datasets.

Now, think about your situation, you might use PolynomialFeatures from scikit-learn to generate interactions and polynomial features that is more automated than other approaches.

Having said that, based on your score, in a real application, using these approaches, or trying to improve performance using more features, is overwhelming and not useful. We use kaggle as our standard training, and I think it is really good, but business requires explanation as well, and they will be happy with less predictive algorithm if they can trust on it.

Victor Oliveira

Posted 2019-07-30T03:15:46.577

Reputation: 725

Explanation is hard with anonymous columns. – Itamar Mushkin – 2020-06-30T14:24:10.517

What do you mean by that? @ItamarMushkin – Victor Oliveira – 2020-06-30T14:31:50.713

0

Another related problem, which I think worth considering, before generating more features, is to determine which columns are important for classification task, i.e. improve prediction of target variable.

One common way is to rely on feature importance scores, but a disadvantage of this is that the scores are only available after training model. To guess which features improve prediction of target variable before training, we can compute correlation among features and target. Note that correlation computation only works for numeric features. To measure relationship between a categorical feature and a numeric target, we need other measurements, called association. You can find out more in this nice article.

Victor Luu

Posted 2019-07-30T03:15:46.577

Reputation: 223