What can be done to correct for sampling bias introduced from (noisy) training data while training a DNN?

4

1

The obvious solution is to ensure that the training data is balanced - but in my particular case that is impossible. What corrections can one perform in such a scenario?

I know that my training data is heavily biased towards a particular class, say, and I cannot change that. Moreover, the labels are very noisy. Conditioned on this piece of information, is there anything I can do by tweaking the training process itself/ something else, to correct for the bias in the training data?

The data comes from an experiment (from an electron microscope), and I cannot collect more data. It's always going to be biased in this way, so alternatively-biased is also not an option. I'm sorry that I'm unable to provide any more details due to confidentiality.

Tejal

Posted 2016-08-22T19:28:42.590

Reputation: 61

Is this a programming question? – Mithical – 2016-08-22T19:53:34.690

I'm VTCing as 'Unclear what you're asking'. – Mithical – 2016-08-22T20:08:56.393

No, I'm talking about algorithmic changes – Tejal – 2016-08-22T20:10:11.067

You'll have to add a lot more info into your question. Like, why the solution is impossible in your case. I'm still not sure what your problem is... – Mithical – 2016-08-22T20:12:17.040

Let me rephrase - I know that my training data is heavily biased towards a particular class, say, and I cannot change that. Conditioned on this piece of information, is there anything I can do by tweaking the training process itself/ something else, to correct for the bias in the training data? Does this make more sense? – Tejal – 2016-08-22T20:16:31.573

Please edit it into your post. – Mithical – 2016-08-22T20:19:07.687

1What prevents you from adding additional unbiased (or alternatively-biased) training data? – NietzscheanAI – 2016-08-22T20:20:04.177

The data comes from an experiment (from an electron microscope), and I cannot collect more data. It's always going to be biased in this way, so alternatively-biased is also not an option. I'm sorry that I'm unable to provide any more details due to confidentiality. – Tejal – 2016-08-22T20:22:06.703

@Tejal Concrete details aren't really necessary - it should be entirely possible to discuss this in the abstract. Why not fit a parameteric statistical model (e.g. Mixture of Gaussians) to your data, play around with the model parameters to adjust the bias, then use the model as a generator for the training set? – NietzscheanAI – 2016-08-23T07:10:36.687

Answers

2

I feel like from the information your giving (some sort of biased data) you cant get an answer as robust as you'd like (what algorithmic changes can be made).

In general, the reason these methods like DNN's work is that they learn off of the data. What you train it to do is what it is capable of, and there's little one can do to 'balance' it to classes of data it just never sees. It's like training someone to do algebra then giving them a trigonometry test. It's all math, sure, but you just never can expect much without the proper learning.

That being said, you should perhaps look at other methods to work with this data, or to approach the problem. Given that you cannot collect unbiased data and that you can't explain more due to confidentiality, I really doubt anyone here can help you that much.

I can at most point you to this article : "Classification on Data with Biased Class Distribution".

And suggest that perhaps your current approach may not be the most approrpiate given the unfortunate circumstances.

Avik Mohan

Posted 2016-08-22T19:28:42.590

Reputation: 676

Thanks! I found this paper that is very relevant to my problem - https://arxiv.org/pdf/1412.6596v3.pdf. I should have mentioned that my labels are very noisy, so stratified sampling is not a solution.

– Tejal – 2016-08-22T23:45:18.837

I see. Good to know you found a relevant paper. Good luck in your research! – Avik Mohan – 2016-08-23T13:47:56.020

@Tejal Could you post this paper as an answer with some details why it was relevant to your problem and how did you solve it. – kenorb – 2017-01-12T12:22:50.350