# Definition and origin of “cross entropy”

Without citing sources, Wikipedia defines the cross-entropy of discrete distributions $P$ and $Q$ to be

\begin{align}
\mathrm{H}^{\times}(P; Q)
&= -\sum_x p(x)\, \log q(x).
\end{align}

Who was first to start using this quantity? And who invented this term? I looked in:

J. E. Shore and R. W. Johnson, “Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy,” Information Theory, IEEE Transactions on, vol. 26, no. 1, pp. 26-37, Jan. 1980.

I followed their introduction to

A. Wehrl, “General properties of entropy,” Reviews of Modern Physics, vol. 50, no. 2, pp. 221-260, Apr. 1978.

who never uses the term.

Neither does

S. Kullback and R. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79-86, 1951.

I looked in

T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006.

and

I. Good, “Maximum Entropy for Hypothesis Formulation, Especially for Multidimensional Contingency Tables,” The Annals of Mathematical Statistics, vol. 34, no. 3, pp. 911-934, 1963.

but both papers define cross-entropy to be synonymous with KL-divergence.

The original paper

C. E. Shannon, “A Mathematical Theory of Communication,” Bell system technical journal, vol. 27, 1948.

Doesn’t mention cross entropy (and has a strange definition of “relative entropy”: “The ratio of the entropy of a source to the maximum value it could have while still restricted to the same symbols”).

Finally, I looked in some old books and papers by Tribus.

Does anyone know what the equation above is called, and who invented it or has a nice presentation of it?

It seems to be closely related to the concept of Kullback–Leibler divergence (see Kullback and Leibler, 1951). In their article Kullback and Leibler discuss the mean information for discriminating between two hypotheses (defined as $I_{1:2}(E)$ in eqs. $2.2-2.4$) and cite pp. 18-19 of Shannon and Weaver’s The Mathematical Theory of Communication (1949) and p. 76 of Wiener’s Cybernetics (1948).