I have an application where I want to find the locations of objects on a simple, relatively constant background (fixed camera angle, etc). For investigative purposes I've created a test dataset which displays many characteristics of the actual problem.
Our problem description is to find the bounding box of the single circle in the image. If there is more than one circle or no circles, we don't care about the bounding box (but we at least need to know that there is no valid single bounding box).
For my attempt to solve this, I built a CNN that would regress
(min_x, min_y, max_y, max_y) as well as one more value which could indicate how many circles were in the image.
I played with different architecture variations, but in general the architecture a was very standard CNN (3-4 relu conv layers with max pooling in between, followed by a dense layer and an output layer with linear activation for the bounding box outputs, set to minimise the mean squared error between the outputs and the ground truth bounding boxes).
Regardless of the architecture, hyperparameters, optimizers, etc, the result was always the same - the CNN could not even get close to building a model that was able to regress an accurate bounding box, even with over 50000 training examples to work with.
What gives? Do I need to look at using another type of network as CNNs are more suited to classification rather than localisation tasks?
Obviously there are computer vision techniques that could solve this easily, but due to the fact that the actual application is more involved, I want to know strictly about NN/AI approaches to this problem.