how to know if there is a bias in data collection methods


I am collecting data for machine learning models I want to build for some application.

I started with random sampling (just simply collecting 'recent' data) but I am not getting enough records of interest, which I want to use as my target. The process is currently automated.

I don't want to use SMOTE unless it works really well with text(NLP) data.

I also don't really want to tackle this as an anomally detection either.

Idealy, I want to build a look-alike model to give "authenticity score".

I could manually search for targets I am lacking or set up another collecting method, which may be more efficient than the random sampling.

What I am afraid of is that this might introduce some biases in my data.

How can I know if there is a bias?

My data consists of text, categorical, and continuous variables.

If all the samples are mixed and shuffled nicely, would that be alright unless training data is from one sampling method and testing data is from another?

Hiro Nakagame

Posted 2020-12-31T18:42:16.277

Reputation: 35

Is the issue that you’re winding up with unbalanced classes in a classification problem? – Dave – 2020-12-31T20:53:26.763

yes that is my main issue cuz currently less han 1% of the collected samples are my target – Hiro Nakagame – 2020-12-31T20:54:41.507

Look up Stephan Kolassa on Cross Validated, He has a post on why unbalanced classes aren’t a problem at all! – Dave – 2020-12-31T21:07:43.237

artificialy oversampling the minority class (if this is the case) IS going to introduce bias. In simple terms unbiased sampling means representative sampling in the sense that it reflects the underlying distribution. Since this is unknown, random sampling is the best approach in general as long as it is indeed diversified and random – Nikos M. – 2021-01-01T17:36:27.430

Theoretically, I understand that it would introduce bias. If I call random samples A and artificially targeted samples B, I expect the target variable to have different distributions.But if A and B have very similar distributions for other continuous/categorical features, would there be any issues/concerns because of the sampling methonds? – Hiro Nakagame – 2021-01-02T06:11:03.743

No answers