This question gives a quantitative definition of cross entropy,

in terms of it’s formula.I’m looking for a more notional definition,

wikipedia says:In information theory, the cross entropy between two probability

distributions measures the average number of bits needed to identify

an event from a set of possibilities,if a coding scheme is used based

on a given probability distribution q, rather than the “true”

distribution p.I have Emphasised the part that is giving me trouble in understanding this.

I would like a nice definition that doesn’t require separate (pre-existing) understanding of Entropy.

**Answer**

To encode an event occurring with probability p you need at least log2(1/p) bits (why? see my answer on “What is the role of the logarithm in Shannon’s entropy?”).

So in optimal encoding the average length of encoded message is

∑ipilog2(1pi),

that is, Shannon entropy of the original probability distribution.

However, if for probability distribution P you use encoding which is optimal for a different probability distribution Q, then the average length of the encoded message is

∑ipicode_length(i)=∑ipilog2(1qi),

is cross entropy, which is greater than ∑ipilog2(1pi).

As an example, consider alphabet of four letters (A, B, C, D), but with A and B having the same frequency and C and D not appearing at all. So the probability is P=(12,12,0,0).

Then if we want to encode it optimally, we encode A as 0 and B as 1, so we get one bit of encoded message per one letter. (And it is exactly Shannon entropy of our probability distribution.)

But if we have the same probability P, but we encode it according to distribution where all letters are equally probably Q=(14,14,14,14), then we get two bits per letter (for example, we encode A as 00, B as 01, C as 10 and D as 11).

**Attribution***Source : Link , Question Author : Frames Catherine White , Answer Author : Piotr Migdal*