I am collecting data for machine learning models I want to build for some application.
I started with random sampling (just simply collecting 'recent' data) but I am not getting enough records of interest, which I want to use as my target. The process is currently automated.
I don't want to use SMOTE unless it works really well with text(NLP) data.
I also don't really want to tackle this as an anomally detection either.
Idealy, I want to build a look-alike model to give "authenticity score".
I could manually search for targets I am lacking or set up another collecting method, which may be more efficient than the random sampling.
What I am afraid of is that this might introduce some biases in my data.
How can I know if there is a bias?
My data consists of text, categorical, and continuous variables.
If all the samples are mixed and shuffled nicely, would that be alright unless training data is from one sampling method and testing data is from another?