I have some doubts regarding an analysis. I have a dataset with class imbalance. I am trying to investigate some information from that data, e.g., how many urls contain http or https protocols. My results are as follows:
http in dataset with class 1: 10 http in dataset with class 0: 109 https in dataset with class 1: 180 https in dataset with class 0: 1560
I am trying to build a classifier based on some features and the presence of protocols was supposed to be taken into account. However, on the basis of the above results, what do you think I should say? Does it make sense to say that the most websites having class 0 have an https protocol, even if I have a dataset with class imbalance? For a model, I would consider resampling techniques. Should I work on this analysis (so make this conclusion) after the resampling, or it would make sense to check features importance with other tests (e.g., Pearson correlation, if it is appropriate in this case)?
Any suggestion would be greatly appreciated it.