3

Taking out the weighting factor we can define focal loss as

$$FL(p) = -(1-p)^\gamma log(p) $$

Where $p$ is the target probability. The idea being that single stage object detectors have a huge class imbalance between foreground and background (several orders of magnitude of difference), and this loss will down-scale all results that are positively classified compared to normal cross entropy ($CE(p) = -log(p)$) so that the optimization can then focus on the rest.

On the other hand, the general optimization scheme uses the gradient to find the direction with the steepest descent. There exists methodologies for adaption, momentum and etc but that is the general gist.

$$ \theta \leftarrow \theta - \eta \nabla_\theta L $$

Focal losses gradient follows as so

$$\dot {FL}(p) = \dot p [\gamma(1-p)^{\gamma -1} log(p) -\frac{(1-p)^\gamma}{p}]$$ compared to the normal crossentropies loss of

$$ \dot{CE}(p) = -\frac{\dot p}{p}$$

So we can now rewrite these as

$$\dot{FL}(p) = (1-p)^\gamma \dot{CE}(p) + \gamma \dot p (1-p)^{\gamma -1} log(p)$$

The initial term, given our optimization scheme will do what we (and the authors of the retinanet paper) want which is downscale the effect of the labels that are already well classified but the second term is slightly less interpretative and in parameter space and may cause an unwanted result. So my question is why not remove it and only use the gradient

$$\dot L = (1-p)^\gamma \dot{CE}(p)$$

Which given a $\gamma \in \mathbb{N}$ produces a loss function

$$ L(p) = -log(p) - \sum_{i=1}^\gamma {\gamma \choose i}\frac{(-p)^i}{i}$$

**Summary:** Is there a reason we make the loss adaptive and not the gradient in cases like focal loss? Does that second term add something useful?