1

I am trying simulate data from a normal distribution but bias the sample by excluding all negative values and values divisible by 5 . To demonstrate the effect of bias . I will probably calculate the statistics from the original vs biased sample. I want to generate the data for samples 100,100,1000 and show plots using python . I hope this clarifies the question. I'm not sure how to approach.

what do you consider a biased dataset? – oW_ – 2019-02-15T19:01:43.463

Maybe a sample that excludes negative values from a standard normal distribution or a sample that excludes values divisible by 5? . I just trying to show that a bigger dataset does get rid of selection bias – DrYoms – 2019-02-15T19:11:09.523

Well, by bias perhaps you mean error in training dataset rather validation, right? This should be a case if it is for school. See here: https://www.dataquest.io/blog/learning-curves-machine-learning/. Said that it is not necessarily true, often bias can be reduced by increasing sample size, or at least it won't make it any worse!!

– TwinPenguins – 2019-02-15T22:13:36.553Thank you. I'm trying to show that with selection bias though no matter how big you sample it is difficult to get rid of the bias because of the selection criteria – DrYoms – 2019-02-15T23:54:04.980