4

1

I have read articles on how Jensen-Shannon divergence is preferred over Kullback-Leibler in measuring how good a distribution mapping is learned in a generative network because of the fact that JS-divergence better measures distribution similarity when there are zero values in either distribution.

I am unable to understand how the mathematical formulation of JS-divergence would take care of this and also what advantage it particularly holds qualitatively apart from this edge case. Could anyone explain or link me to an explanation that could answer this satisfactorily?