3

Im working on selecting most effective features from a dataset with over that 2000 features. Im using different algorithms for that (selectKBest with chi-square, Extra Trees, Correlation etc.) But when I look the features ranking I saw that selectKBest with chi-square are generating excatly same results as Correlation. Is it possible or am I doing something wrong?

My all features consist of 64bit float continuous numbers, between [-8,11] and my target column is binary which can be only 0 or 1.

**Updated on 05.09.19: I am STILL searching how can it be possible? I mean I can guess that both methods based on same formula and developed by same person but I need a proof for understand that clearly.**

Correlation Function:

```
cor = data.corr()
# Class is my target column
cor_target = abs(cor["Class"])
# Want to get correlation values for every feature without target column
relevant_features = cor_target[cor_target > 0].drop(labels=["Class"])
#Top 1000 features
relevant_features = pd.Series(relevant_features, index=data.columns).nlargest(1000).index.values
```

SelectKBest function:

```
bestfeatures = SelectKBest(score_func=chi2, k="all")
fit = bestfeatures.fit(dataValues, dataTargetEncoded)
feat_importances_chi = pd.Series(fit.scores_, index=dataValues.columns).nlargest(1000).index.values
```

And the result relevant_features and feat_importances_chi have excatly same results.

What kind of features do you have? Numeric ? Categorical? Binary? – astel – 2019-08-28T14:06:27.457

Thank you for answer. I updated my question. – justRandomLearner – 2019-08-29T13:35:06.713

1I guess this first thing I will suggest is that the chi squared statistic is intended for categorical variables not for continuous variables. There is likely some binning done internally by your function but I can't find any documentation on how. – astel – 2019-08-29T14:51:18.027

Yes, in general there are deficiencies in the documentation but I asked a question about it maybe it helps you too: https://stackoverflow.com/questions/57273694/how-selectkbest-chi2-calculates-score

– justRandomLearner – 2019-09-02T11:49:43.873