This question gives a quantitative definition of cross entropy,
in terms of it’s formula.
I’m looking for a more notional definition,
In information theory, the cross entropy between two probability
distributions measures the average number of bits needed to identify
an event from a set of possibilities, if a coding scheme is used based
on a given probability distribution q, rather than the “true”
I have Emphasised the part that is giving me trouble in understanding this.
I would like a nice definition that doesn’t require separate (pre-existing) understanding of Entropy.
To encode an event occurring with probability p you need at least log2(1/p) bits (why? see my answer on “What is the role of the logarithm in Shannon’s entropy?”).
So in optimal encoding the average length of encoded message is
that is, Shannon entropy of the original probability distribution.
However, if for probability distribution P you use encoding which is optimal for a different probability distribution Q, then the average length of the encoded message is
is cross entropy, which is greater than ∑ipilog2(1pi).
As an example, consider alphabet of four letters (A, B, C, D), but with A and B having the same frequency and C and D not appearing at all. So the probability is P=(12,12,0,0).
Then if we want to encode it optimally, we encode A as 0 and B as 1, so we get one bit of encoded message per one letter. (And it is exactly Shannon entropy of our probability distribution.)
But if we have the same probability P, but we encode it according to distribution where all letters are equally probably Q=(14,14,14,14), then we get two bits per letter (for example, we encode A as 00, B as 01, C as 10 and D as 11).