Model biased towards low frequency data?


Generally model gets biased towards data_samples/target whose frequency is high in training data set. Is it possible during training that model gets biased towards low frequency training data set.

vipin bansal

Posted 2019-05-24T02:59:13.803

Reputation: 1 322

Could you please elaborate your question and problem? – Shubham Panchal – 2019-05-24T04:46:03.657

We have a dataset of binary classifier. Where class 1 data is huge whereas class 0 is having very less data, i.e. data is skewed. During model training, its quite possible that model should be biased towards class 1 and its expected. I want to know is it also possible if model get biased toward class 0? – vipin bansal – 2019-05-24T04:48:01.587



With structured data, you have in general 4 challenges:

(1) Missing data

(2) Outliers

(3) Cardinality

(4) Rare values (as a rule of thumb <5%)

Rare values in categorical variables tend to cause over-fitting, particularly in tree based methods. Ph.D. Data Scientist Soledad Galli has an amazing course on the subject (Udemy: "Feature Engineering". Below a screenshot from her course, but to be fair to her, I'm not going to post the solution.

enter image description here


Posted 2019-05-24T02:59:13.803

Reputation: 901

1Thanks for the reference. – vipin bansal – 2019-06-05T12:43:36.933