How to correct mislabeled data in dataset?



I have a dataset of about 300k records. Classes are highly imbalanced (which means that one may have 30k records, and the other may have only 100). Unfortunately, about 5% of records is incorrectly labeled.

Is there any way of finding out which elements are wrong, so I would be able to discard them?


Posted 2019-09-18T10:44:11.480

Reputation: 21



Yes! This could be an excellent test case for your classification algorithm. With only 5% mislabeling a good algorithm will be easily able to identify "outliers" by having much worse predictions for these mislabeled records.

If you were able to at least identify "correct" records to generate a training set that would be even better but with 5% mislabeling even if not it won't be a problem. This also leads to the second part of the answer, that while it might be better to remove or correct the mislabeled records it might also not matter.

This is obviously based on the assumption that the 5% errors are somewhat randomly distributed over all classes.

Finally, you did not mention any hints/data/info that could identify mislabels. Obviously if you have information about those errors doing some pre-processing to identify and remove them based on analysis / rules generation would be best.


Posted 2019-09-18T10:44:11.480

Reputation: 1 300