## Why is Jensen-Shannon divergence preferred over Kullback-Leibler divergence in measuring the performance of a generative network?

4

1

I have read articles on how Jensen-Shannon divergence is preferred over Kullback-Leibler in measuring how good a distribution mapping is learned in a generative network because of the fact that JS-divergence better measures distribution similarity when there are zero values in either distribution.

I am unable to understand how the mathematical formulation of JS-divergence would take care of this and also what advantage it particularly holds qualitatively apart from this edge case. Could anyone explain or link me to an explanation that could answer this satisfactorily?

\begin{align} D_{JS}(p||q) &= \frac{1}{2}[D_{KL}(p||\frac{p+q}{2}) + D_{KL}(q||\frac{p+q}{2})] \\ &= \frac{1}{2}\sum_{x\in\Omega} [p(x)log(\frac{2 p(x)}{p(x)+q(x)}) + q(x)log(\frac{2 q(x)}{p(x)+q(x)})] \end{align} Where $$\Omega$$ is the union of the domains of $$p$$ and $$q$$. Now lets assume one distribution is zero where the other is not (without loss of generality due to symmetry we can just say $$p(x_i) = 0$$ and $$q(x_i) \neq 0$$. We then get for that term in the sum
$$\frac{1}{2}q(x_i)log(\frac{2q(x_i)}{q(x_i)}) = q(x_i)\frac{log(2)}{2}$$