Isn't it a problem that $y_i$ (in $\log(y_i)$) could be 0?

Yes it is, since $\log(0)$ is undefined, but this problem is avoided using $\log(y_i + \epsilon)$ in practice.

What is correct?

(a) $H_{y'} (y) := - \sum_{i} y_{i}' \log (y_i)$ or

(b) $H_{y'}(y) := - \sum_{i} ({y_i' \log(y_i) + (1-y_i') \log(1-y_i)})$?

(a) is correct for multi-class prediction (it is actually a double summation), (b) is the same as (a) for two-class prediction. Both are cross-entropy.

### Example:

Suppose each training data $x_i$ has label $c_i' \in \{0, 1\}$, and model predicts $c_i \in [0, 1]$.

For 5 data points, true label $c_i'$ and model prediction $c_i$ are:

$(c_i', c_i)=\{(0, 0.1), (0, 0.4), (0, 0.8), (1, 0.8), (1, 0.2)\}$ (1),

Define vectors $y_i'$ and $y_i$ as

$y_{ik}':=1$ if $c_i'=k$, and $:=0$ otherwise,

$y_{ik}:=p(k|x_i)$ is the probability of $x_i$ belonging to class $k$, which is estimated by model.

Example (1) in $(y_i', y_i)$ notation turns into:

$(y_i', y_i)=\{([1, 0], [0.9, 0.1]),$ $([1, 0], [0.6, 0.4]),$ $([1, 0], [0.2, 0.8]),$ $([0, 1], [0.2, 0.8]),$ $([0, 1], [0.8, 0.2])\}$,

Both (a) and (b) are calculated as:

$H_{y'}(y)=-1/5([log(0.9)+log(0.6) + log(0.2)]_{c_i=0} + [log(0.8) + log(0.2)]_{c_i=1}) = 0.352$

### Derivation:

Suppose there is multiple classes $1$ to $K$.

For training point $(x_i, c_i')$, $c_i' = k$ is equivalent to $y_i'=[0,..,1,0,..]$ which is 1 in $k^{th}$ position and 0 elsewhere. When $y_{ik}'=1$, we want model's output $y_{ik}=p(k|x_i)$ to be close to 1. Therefore, loss of $(x_i, k)$ can be defined as $-log(y_{ik})$, which gives $y_{ik} \rightarrow 1 \Rightarrow -log(y_{ik}) \rightarrow 0$. Loss over all classes can be combined as:

$L(y_i', y_i) = -\sum_{k=1}^{K}y_{ik}'log(y_{ik})$.

When $y_{ik}' = 1$, loss of all other classes $k' \neq k$ is disabled as $0log(y_{ik'})=0$, so for example when true label is $y_{im}'=1$, loss would be:

$L(y_i', y_i)=-log(y_{im})$.

Final formula over all training points is:

$H_{y'}(y)=-\sum_{(x_i, y_i')}\sum_{k=1}^{K}y_{ik}'log(y_{ik})$.

For binary classification, we have $y_{i0}' = 1 - y_{i1}'$ (true labels) and $y_{i0} = 1 - y_{i1}$ (model predictions), therefore (a) can be rewritten as:

$\begin{align*}
H_{y'}(y)&=-\sum_{(x_i, y_i')}y_{i1}'log(y_{i1})+y_{i0}'log(y_{i0})\\
&=-\sum_{(x_i, y_i')}y_{i1}'log(y_{i1})+(1-y_{i1}')log(1-y_{i1})
\end{align*}$

which is the same as (b).

**Cross-entropy (a) over classes (one summation)**

Cross-entropy (a) over classes is:

$H_{y'}(y)=-\sum_{k=1}^{K}y_{k}'log(y_{k})$,

This version cannot be used for the classification task. Lets reuse the data from the previous example:

$(c_i', c_i)=\{(0, 0.1), (0, 0.4), (0, 0.8), (1, 0.8), (1, 0.2)\}$

Empirical class probabilities are: $y'_0 = 3/5 = 0.6$, and $y'_1 = 0.4$,

Class probabilities estimated by model are: $y_0 = 3/5 = 0.6$, and $y_1 = 0.4$

(a) is calculated as: $-y'_0logy_0 - y'_1logy_1 = - 0.6log(0.6) -0.4log(0.4) = 0.292$.

Two data points $(0, 0.8)$ and $(1, 0.2)$ are miss-classified but $y'_0$ and $y'_1$ are estimated correctly!

If all 5 points where classified correctly as:

$(c_i', c_i)=\{(0, 0.1), (0, 0.4), (0, \color{blue}{0.2}), (1, 0.8), (1, \color{blue}{0.8})\}$ ,

(a) still remains the same, since $y'_0$ is again estimated as $y_0=3/5$.

1

See also: Kullback-Leibler Divergence Explained blog post.

– Piotr Migdal – 2017-05-11T22:15:59.187See also: http://stats.stackexchange.com/questions/80967/qualitively-what-is-cross-entropy

– Piotr Migdal – 2016-01-22T19:04:06.880