Frankly, even 50 images will not be sufficient if you are going to create and use a CNN model. If you think you want more images for you model training, then go for data augmentation. It is a process of transforming an image by a small amount (be it height, width, rotation etc or any combination of these). In this way, an image and its augmented image will differ slightly. You can find relevant article here-
To answer the part that should there be same number of images in each class, there should be approximately same number. This problem is a general problem while working on classification task and there are several ways to deal with it, including simulating the data (augmentation).
I would suggest that first create a separate test set, then on the remaining train set, use data augmentation and finally create the model.
Using a pretrained convnet is also an option, as stated in a deep learning book-
A common and highly effective approach to deep learning on small image datasets is to use a pretrained network. A pretrained network is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task. If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pretrained network can effectively act as a generic model of the visual world, and hence its features can prove useful for many different computer vision problems, even though these new problems may involve completely different classes than those of the original task. For instance, you might train a network on ImageNet (where classes are mostly animals and everyday objects) and then repurpose this trained network for something as remote as identifying furniture items in images.