I have been working with the cats vs dogs dataset from kaggle which consist on 25000 images of cats and dogs labelled accordingly (btw, great dataset, totally recommended!)
So far so good, now I have 50000 images, but the question is: should I make SURE that an image that has been augmented remains totally into the training set? I mean, with the previous two images, what would happen if I have one of them in the training set and another one in the test (or cross validation) set? does it mean that I am leaking training data into test data?
I understand that technically they are DIFFERENT images, but still my intuitions seems to resist such idea.
Am I correct to assume that augmented images should not leak to the test set?