When using neural networks to detect features in an image, how can locate that specific feature in the original image?



I understand how a neural network can be trained to recognise certain features in an image (faces, cars, ...), where the inputs are the image's pixels, and the output is a set of boolean values indicating which objects were recognised in the image and which weren't.

What I don't really get is, when using this approach to detect features and we detect a face for example, how we can go back to the original image and determine the location or boundaries of the detected face. How is this achieved? Can this be achieved based on the recognition algorithm, or is a separate algorithm used to locate the face? That seems unlikely since to find the face again, it needs to be recognised in the image, which was the reason of using a NN in the first place.


Posted 2016-09-29T14:21:57.723

Reputation: 761

The recent Stanford deep learning course has touched on exactly this problem: localization and detection: see, for example, Lecture 8 Localization and Detection and Lecture 13 - Segmentation, soft attention, spatial transformers

– user7427549 – 2017-01-17T01:00:36.013



This problem is called object detection.

If you have a trainings set of images with boxed objects you can just train a neural network to directly predict the box. I.e. the output has the same dimension as the input and the NN learns to assign each pixel the probability of belonging to a certain object.

If you don't have such a convenient dataset you could just recursively narrow the location down by feeding parts of the image to the network until you find the smallest part that still fully activates a certain classification.

In this paper they try a mixture of these two approaches.


Posted 2016-09-29T14:21:57.723

Reputation: 3 667


The approach you listed here is not really an approach, this is very very vague idea of how someone can achieve some task. You basically told we have an algorithm f(image) = result and there can be infinite amount of real approaches to solve this.

In majority of CNN approaches the image travels through a convolution/pooling layers which reduces the dimensions of each current layer. In the end you end up with a significantly smaller layer which goes through the softmax and gets probabilities of different classes. This type of networks does not tell you where something was found, it just tells you that something was found somewhere in your original image.

Salvador Dali

Posted 2016-09-29T14:21:57.723

Reputation: 406


To add to @BlindKungFuMaster's answer, the object detection problem in NN context isn't necessarily solved by using a network that produces output with the same size as the input. Actually this approach - assigning each pixel the object probability - better fits the Instance/ Semantic Segmentation problem.

To briefly review the common state of the art object detectors, there are mainly two families:

  1. The two-stage approach (one stage find the bounding boxes by a variant of sliding window, another stage finds the object probabilities, some of the layers might be common to both stages) - from RCNN (propose boxes by the Selective Search algorithm, check if they contain an object by running CNN - AlexNet in this case - on each box, if a box contains an object run linear regression to tighten the box), through Fast-RCNN (Share the computations for all the boxes, use ROI pooling to select the region that corresponds to each box), Faster-RCNN (Speed up the region proposal stage by using a fully convolutional Region Proposal Network instead of the heavy Selective Search), R-FCN (Share all computations for both objects and box proposals), and Mask-RCNN that extends the faster-RCNN solution to semantic segmentation, but also solves better the detection problem by fitting the box coordinates+object features correspondence by using bilinear interpolation, replacing ROI pooling with ROI align.
  2. The single-stage approach, such as SSD, DSSD, YOLO, YOLO2 and lastly RetinaNet with the focal loss, that in a single shot produce both boxes and objects scores, and work much faster that the two-stage approaches.

A great summary for object recognition, detection and segmentation is published here - highly recommended.


Posted 2016-09-29T14:21:57.723

Reputation: 249