Let's start with understanding entropy in information theory: Suppose you want to communicate a string of alphabets "aaaaaaaa". You could easily do that as 8*"a". Now take another string "jteikfqa". Is there a compressed way of communicating this string? There isn't is there. We can say that the entropy of the 2nd string is more as, to communicate it, we need more "bits" of information.

This analogy applies to probabilities as well. If you have a set of items, fruits for example, the binary encoding of those fruits would be $log_2(n)$ where n is the number of fruits. For 8 fruits you need 3 bits, and so on. Another way of looking at this is that given the probability of someone selecting a fruit at random is 1/8, the uncertainty reduction if a fruit is selected is $-\log_{2}(1/8)$ which is 3. More specifically,

$$-\sum_{i=1}^{8}\frac{1}{8}\log_{2}(\frac{1}{8}) = 3$$
This entropy tells us about the uncertainty involved with certain probability distributions; the more uncertainty/variation in a probability distribution, the larger is the entropy (e.g. for 1024 fruits, it would be 10).

In "cross"-entropy, as the name suggests, we focus on the number of bits required to explain the difference in two different probability distributions. The best case scenario is that both distributions are identical, in which case the least amount of bits are required i.e. simple entropy. In mathematical terms,

$$H(\bf{y},\bf{\hat{y}}) = -\sum_{i}\bf{y}_i\log_{e}(\bf{\hat{y}}_i)$$

Where $\bf{\hat{y}}$ is the predicted probability vector (Softmax output), and $\bf{y}$ is the ground-truth vector( e.g. one-hot). The reason we use natural log is because it is easy to differentiate (ref. calculating gradients) and the reason we do not take log of ground truth vector is because it contains a lot of 0's which simplify the summation.

Bottom line: In layman terms, one could think of cross-entropy as the distance between two probability distributions in terms of the amount of information (bits) needed to explain that distance. It is a neat way of defining a loss which goes down as the probability vectors get closer to one another.

1Okay. That means, the loss would be same no matter if the predictions are [0.1 0.5 0.1 0.1 0.2] or [0.1 0.6 0.1 0.1 0.1]? – enterML – 2017-07-10T14:48:09.923

@Nain: That is correct for your example. The cross-entropy loss does not depend on what the values of incorrect class probabilities are. – Neil Slater – 2017-07-10T15:25:22.980

@NeilSlater You may want to update your notation slightly. Right now, if

`\cdot`

is a dot product and`y`

and`y_hat`

have the same shape, than the shapes do not match. You may need to add a transpose symbol or redefine`\cdot`

to mean inner product. – Lukas – 2019-12-11T12:48:57.550@Lukas: Good spot, I just defined my use of cdot to mean inner product – Neil Slater – 2020-01-19T08:35:47.560

this answer is so helpful, thank you so much! btw, what does the

`H`

symbol stand for? – SomethingSomething – 2020-03-01T11:55:24.9631

@SomethingSomething: It is the name of an entropy measuring function here. H is often used as symbol for something measuring entropy: https://math.stackexchange.com/questions/84719/why-is-h-used-for-entropy

– Neil Slater – 2020-03-01T12:41:09.313And mostly

`log`

here means`ln`

– bit_scientist – 2020-06-03T00:06:46.633