## Features selection in imbalanced dataset

3

I have some doubts regarding an analysis. I have a dataset with class imbalance. I am trying to investigate some information from that data, e.g., how many urls contain http or https protocols. My results are as follows:

http in dataset with class 1: 10
http in dataset with class 0: 109
https in dataset with class 1: 180
https in dataset with class 0: 1560


I am trying to build a classifier based on some features and the presence of protocols was supposed to be taken into account. However, on the basis of the above results, what do you think I should say? Does it make sense to say that the most websites having class 0 have an https protocol, even if I have a dataset with class imbalance? For a model, I would consider resampling techniques. Should I work on this analysis (so make this conclusion) after the resampling, or it would make sense to check features importance with other tests (e.g., Pearson correlation, if it is appropriate in this case)?

Any suggestion would be greatly appreciated it.

4

What this shows is that the protocol is not a very discriminative feature:

• the probability of class 1 given http is 10/(109+10)=0.084
• the probability of class 1 given https is 180/(180+1560)=0.103

If these conditional probabilities were very different this feature would be more helpful to predict the class, but they differ only slightly. Note that the feature might still be useful, but it doesn't have a huge impact on its own. In case you're interested to know if the difference is significant (i.e. not due to chance), you could do a chi-square test.

Does it make sense to say that the most websites having class 0 have an https protocol, even if I have a dataset with class imbalance?

It is factually correct, but most websites having class 1 also have https so it's not a very useful information (and on its own this information might be confusing for some readers).

For a model, I would consider resampling techniques. Should I work on this after the resampling, or it would make sense to check features importance with other tests (e.g., Pearson correlation, if it his appropriate in this case)?

Feature selection can done either before or after resampling, it doesn't matter. The two things are independent of each other because the level of correlation between a feature and the class is independent from the proportion of the class.

I don't think Pearson correlation is good for categorical variables. I think conditional entropy would be more appropriate here (not 100% sure, there might be other options).

Thank you so much, Erwan, for your answer. It has really helped me to better understand whether to consider or not discriminative features. Much appreciated it. there is a typo in the second probability (it should be 180 at the numerator :) – Val – 2021-02-08T01:02:00.197

@Val: happy to help, thanks for noticing the typo. – Erwan – 2021-02-08T09:55:31.210