Recent advances in Deeplearning and dedicated hardware has made it possible to detect images with a much better accuracy than ever. Neural networks are the gold standard for computer vision application and are used widely in the industry, for example for internet search engines and autonomous cars. In real life problems, the image contains of regions with different objects. It is not enough to only identify the picture but elements of the picture.
A while ago an alternative to the well known sliding window algorithm was described in the literature, called Region Proposal Networks. It is basically a convolution neural network which was extended by a region vector.
Problem that I am trying to solve:
In a given video frame, I want to pick some region of interests (literally), and perform classification on those regions.
How is it currently implemented
- Capture the video frame
- Split the video frame into multiple images each representing a region of interest
- Perform image classification(inference) on each of the image (corresponding to a part of the frame)
- Aggregate the results of #3
Problem with the current approach
Multiple inferences per frame.
I am looking for a solution where I specify the locations of interest in a frame, and inference task, be it object detection (or) image classification, is performed only on those regions.Can you please point to me the references which I need to study (or) use to do this.