AUC ROC metric on a Kaggle competition



I am trying to learn data modeling by working on a dataset from Kaggle competition. As the competition was closed 2 years back, I am asking my question here. The competition uses AUC-ROC as the evaluation metric. This is a classification problem with 5 labels. I am modeling it as 5 independent binary classification problems. Interestingly, the data is highly imbalanced across labels. In one case, there is an imbalance of 333:1. I did some research into interpreting the AUC-ROC metric. During my research, I found this and this. Both these articles basically say that AUC-ROC is not a good metric for an imbalanced data set. So, I am wondering why would they be using this metric to evaluate models in the competition? Is it even a reasonable metric in such a context? If yes, why?

Sahil Gupta

Posted 2020-02-28T00:37:20.400

Reputation: 65



As you would have seen in the research, AUC ROC prioritizes getting the order of the predictions correct, rather than approximating the true frequencies.

Usually, like in the credit card fraud problem you link to, the impact of one or two false negative is more devastating that many false positives. If those classes are imbalanced, like they are in the fraud case, AUC ROC is a bad idea.

It appears that in the competition you are referring to, the hosts are more interested in labeling which comments are more toxic than others rather than rating how toxic they each are. This makes sense since in reality the labels are subjective.


Posted 2020-02-28T00:37:20.400

Reputation: 236

Your answer makes a lot of sense. I guess you pointing at the business intuition behind the technical problem at hand to choose an appropriate metric. I am still confused about why AUC ROC is a bad idea for the credit card problem. For instance, we are interested in minimizing the no. of false negatives. ROC is the plot of TPR v/s FPR. The ideal point (1,0) signifies FN = 0 at the cost of high FP. Then why is AUC a bad idea? Also, can you elaborate on what you mean by "more interested in labeling which comments are more toxic than others rather than rating how toxic they each are". – Sahil Gupta – 2020-02-28T03:27:12.923

2@SahilGupta, I think the blog post you linked to gives a clear answer to your two questions: 1. AUC ROC is a bad idea for that problem since "false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives". and 2. "You should use [AUC ROC] when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities". Perhaps your question is really: Why are metrics like F1 and Accuracy better than AUC ROC for imbalanced data set questions. – nigelhenry – 2020-02-28T06:18:49.743

Thanks for the discussion! This was really informative for me. I went back and did some more reading into AUC. I was confusing the fact that ROC curves are insensitive to class distribution with AUC being good for imbalanced problems. Tom Fawcett's paper on ROC was super helpful in improving my understanding of AUC ROC. Also, the idea behind ranking predictions vs classifying them makes much more sense, thanks to this answer I found during my research:

– Sahil Gupta – 2020-02-29T21:33:36.017

1Contd. Frank Harrell's blog posts were also helpful in understanding the underlying concepts. I was missing the idea of scoring rules in this whole discussion. That also opened up my mind to the fact that AUC ROC is a scoring rule (driving our decisions), once we have some predictions from our model, rather than directly labeling them (like you mentioned labels are subjective). – Sahil Gupta – 2020-02-29T21:38:09.120