2

1

I asked this question on /r/learnmachinelearning, but there was no answer so I'm reposting it here.

I was reading this article on detecting rectangles in an image, here. My doubt is in the part where the model works fine with detecting a single object, but struggles with two rectangles detection.

The author reasons this as follows:

```
We train our network on the leftmost image in the plot above. Let’s say
that the expected bounding box of the left rectangle is at position 1 in the
target vector (x1, y1, w1, h1), and the expected bounding box of the
right rectangle is at position 2 in the vector (x2, y2, w2, h2). Apparently,
our optimizer will change the parameters of the network so that the first
predictor moves to the left, and the second predictor moves to the right.
Imagine now that a bit later we come across a similar image, but this
time the positions in the target vector are swapped (i.e. left rectangle at
position 2, right rectangle at position 1). Now, our optimizer will pull
predictor 1 to the right and predictor 2 to the left — exactly the opposite
of the previous update step! In effect, the predicted bounding boxes stay
in the center.
```

I don't understand how this reasoning is correct, apart from the fact that when they try to flip the rectangles to mitigate this error, accuracy actually improves (so there's experimental observation, but not much theoretical reasoning).

The reason I think so is because in case of single rectangle also the network has to learn for all differently placed objects just as in the two-rectangle-case, so there too, it should predict boxes in the somewhat the center only. I have to concede I am only a noob in this, so I would love to find out where I am wrong in my reasoning, because experimentally I am wrong (i.e. accuracy does improve when rectangles are flipped).

Thoughts? Also if this is not the correct sub/forum for these type of questions, please feel free to guide me towards those that better suit the content.