There exist a number of tools that provide a distance between two continuous probability distributions. Most (semi)distances, like the Kullback-Leibler divergence, use probability density functions. However, the literature is quite sparse when it comes to comparing two distributions in Fourier space, i.e. via their characteristic function. Is there an elegant way to do so?

**Answer**

One notable distance that can be considered in Fourier space is the maximum mean discrepancy (MMD). One first selects a positive semi-definite kernel k:X×X→R corresponding to a reproducing kernel Hilbert space (RKHS) Hk. The MMD is then

MMDk(P,Q)=sup

You might be familiar with the energy distance; that is a special case of the MMD for a particular choice of kernel.

This is a proper metric for many choices of kernel, called *characteristic* kernels; it is always a semimetric.

What does this have to do with the Fourier transform? Well, if \mathcal X = \mathbb R^d and k(x, y) = \psi(x – y), so that \psi : \mathbb R^d \to \mathbb R is a positive-definite function, then it turns out the MMD can *also* be written as

\operatorname{MMD}_k(\mathbb P, \mathbb Q)

= \sqrt{\int \left\lvert \varphi_{\mathbb P}(\omega) – \varphi_{\mathbb Q}(\omega) \right\rvert^2 \, \mathrm{d}\hat\psi(\omega)}

where \varphi denotes the characteristic function, and \hat\psi is the Fourier transform of \psi in the measure sense. (It will always be a finite nonnegative measure; you can see from this definition that a translation-invariant kernel is characteristic iff its Fourier transform is everywhere positive.)

For a proof, see Corollary 4 of

Sriperumbudur et al.,

Hilbert space embeddings and metrics on probability measures, JMLR 2010.

The MMD – which you can easily estimate via the third form above – thus compares distributions by the L_2 distance between their full characteristic functions, with frequencies weighted according to the choice of kernel. For example, the common Gaussian kernel k(x, y) = \exp\left( -\frac{1}{2\sigma^2} \lVert x – y \rVert^2 \right) will weight the frequencies with a Gaussian with mean 0 and variance 1/\sigma^2.

Sometimes it’s better, and sometimes computationally faster / more informative, to instead compare the characteristic functions at particular locations rather than everywhere. It turns out it’s better to make a slight tweak, evaluating differences in *smoothed* characteristic functions at random locations:

Chwialkowski et al.,

Fast Two-Sample Testing with Analytic Representations of Probability Measures, NeurIPS 2015.

A follow-up work finds the most informative frequencies to test, rather than random:

Jitkrittum et al.,

Interpretable Distribution Features with Maximum Testing Power, NeurIPS 2016.

These are all closely connected to classical tests based on empircal characteristic functions as mentioned by kjetil in the comments.

**Attribution***Source : Link , Question Author : yannick , Answer Author : Danica*