Should I prevent augmented data to leak to the test/cross validation sets

6

I have been working with the cats vs dogs dataset from kaggle which consist on 25000 images of cats and dogs labelled accordingly (btw, great dataset, totally recommended!)

One of the things I did was to augment the data by simply fliping the images 180 degrees, so an image like this enter image description here will became this enter image description here

So far so good, now I have 50000 images, but the question is: should I make SURE that an image that has been augmented remains totally into the training set? I mean, with the previous two images, what would happen if I have one of them in the training set and another one in the test (or cross validation) set? does it mean that I am leaking training data into test data?

I understand that technically they are DIFFERENT images, but still my intuitions seems to resist such idea.

Am I correct to assume that augmented images should not leak to the test set?

Juan Antonio Gomez Moriano

Posted 2018-01-19T04:05:24.083

Reputation: 1 071

Answers

4

Yes!

The accuracy of your algorithm should be tested on images that you expect to receive. This would not include data that has been rotated, shifted, blurred, etc. However, augmenting your data in such ways can and usually will lead to better results.

Split your data first. Then augment only the training data while staying away from the validation data. Only use your testing data when you are confident that your solution leads to good results on your validation set without overfitting it.

JahKnows

Posted 2018-01-19T04:05:24.083

Reputation: 7 863