## Would this relatively small dataset be enough to train a CNN?

4

Scenario: I am trying to create a dataset with images of choice for different animal classes. I am going to train those images for classification using CNN.

Problem: Let's assume I somehow don't have the privilege to collect too many images and was only able to collect a few of them for each class. Here's the list:

| id | animal       | #     |
|----|--------------|-------|
| 1  | Baboon       | 800   |
| 2  | Fox          | 1000  |
| 3  | Hyena        | 5000  |
| 4  | Giraffe      | 43    |
| 5  | Zebra        | 88    |
| 6  | Hippopotamus | 233   |
| 7  | Yak          | 578   |
| 8  | Polar Bear   | 456   |
| 9  | Lion         | 3442  |
| 10 | Indian Tiger | 40000 |


I have three questions.

1. Is this a good dataset to train the CNN model? I am worried about the quantity each class has.

2. Will it be helpful if I augment the data? I think I am going to augment it.

3. In the future, the above-mentioned dataset is going to increase. So there is a chance that I will train the model again. Should I create a model that fits the data of the present size or should I create a bigger one in order to adjust future data?

I can get data from the Internet. But this question is about the approaches to take when we have a small amount of data, like the one in National Data Science Bowl (classifying Planktons).

0

You can build this just with 100 images. In your case, Zebra and Giraffe need more images. With DNNClassifier (TensorFlow), you can do it. But the more images you have, the more accurate your classifier will be.

I suggest that you also watch the video: Train an Image Classifier with TensorFlow for Poets - Machine Learning Recipes #6.

3

Your data set would be what is called "unbalanced' and this can lead to problems in developing an accurate classifier.

The best thing to do (which you might not be able to do) is to find more images for those classes with a smaller number of images.

Another alternative is to synthetically produce more images. One way to do that is to use the Keras ImageDataGenerator.flow_from_directory. Documentation is at https://keras.io/preprocessing/image/. Create a directory (your_dir). In it, create a subdirectory Giraffe. Place all your 43 giraffe images into that directory. Create another directory your_save_dir, and leave it empty. Now, create the generators shown below.

datagen = ImageDataGenerator(rotation_range = 30, width_shift_range = 0.2,
height_shift_range = 0.2,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True,
fill_mode = "nearest")

data=datagen.flow_from_directory(your_dir, target_size=(200, 200),
batch_size=43, shuffle=False,
save_to_dir=your_save_dir,save_format='png',
interpolation='nearest')

images,labels=data.next()


Now, each time you execute the last line of code, you will generate and store 43 more images in your_save_dir. These images will be transformed per the parameters in the image data generator in a random manner. While NOT as good as having truly original images, it will help significantly to balance the data set.

Do the same of course for the other image sets that have a small number of samples.

Another thing that can help is, for the sets with fewer images, first, crop the images so that the animal occupies as high a percentage of pixels as possible in the cropped image. Then do the process defined above. This gives the network a higher percentage of meaningful pixels to "learn" from.

3

It is somewhat risky to discuss data independently with your learning mechanism. There is actually no such thing as good data or a good learner. There is only data that is good WITH a particular learner. That is even true of human intelligence after all the standardized education and testing done today.

There are also exceptional learners that find data to be good when most others fumble with it.

If by good data and deep learning you mean image sets that will lead to proper categorization of unsuspected images presented in production, your intuitive understanding of statistics can provide you with a general answer. The images on which the deep learner develops its activation weights and meta-parameters to provide adequate production behavior must be representative of the range of images that will be found in the production feeds.

If you intended to do a study of men and women to determine if the old belief that women are more motivated by the prospect of love and men are more motivated by the prospect of sex, you wouldn't pick 43 men and 40,000 women for the study. The study's value is limited by the lower of the two numbers.

You can train the network with the category frequencies you have, but some deep learners may capitalize fully on feature extraction for Indian Tigers and Hyenas but exhibit an unacceptable level mis-categorization of Zebras and Giraffes.

Returning to the concept above, the skew in category frequency can be accounted for by the deep learner. It is theoretically possible to create an exceptional learner or one that is well attuned to this kind of frequency skew. A simple approach is to develop a scheme that recognizes frequency skew and allocates additional computing resources to the training that focuses on the differentiation of similar animals with infrequent labeled training instances.

I don't recall who has done that, but I know it has been done.

There are several ways you can give extra attention to the infrequent categories manually in the code, but then it would be a less general solution and the resulting program would neither be an exceptional learner nor particularly reusable.

It is more cost effective to hunt for a skew resistant deep learning scheme and test its accuracy for infrequent animals than sending a photographer to Africa. If you can find more images of the less frequent animals without a monumental effort, I would do that too.