# Density estimation for large dataset

I have a unidimensional data set with more than 1000000 observations.

Assuming that those observations are independent realizations of the same random variable I need to estimate the underling density function.

This estimated density function will be evaluated a large number of times. For this reason I’m concerned with the computational complexity of evaluation.

What is the best method for this case? Under what assumptions?

There are some tricks you can play with kernel density estimators (KDEs) to improve runtime. For example, you can use a kernel with compact support (e.g. an Epanechnikov or triweight kernel). To obtain perfect accuracy using a kernel with infinite support (e.g. a Gaussian), you’d have to sum over all kernels every time you want to evaluate the KDE. However, with compact support, you’d only need to sum over kernels whose support the evaluation point falls within (because the kernels have value 0 outside this range). Say the data points are $\{x_1, ..., x_n\}$ and you want to evaluate the KDE at point $x'$. Let $r$ be the ‘radius’ of the kernel function (half the width of the support). You can search for data points that fall within distance $r$ of $x'$, and evaluate those kernels only. The search can be performed efficiently (requiring sublinear time). In general, people use tree data structures (e.g. kd trees, ball trees) to accelerate the search. Since your data are 1d, you might be able to get away with something simpler (like storing the data in a sorted list and using binary search). For kernels with infinite support, it’s possible to employ a similar strategy (evaluating only nearby kernels), at the cost of a decrease in accuracy.

The following paper describes the use of tree data structures for accelerating KDEs. It also describes a ‘dual tree’ strategy, where a second tree is used to partition the points at which the KDE is to be evaluated. This allows processing them in chunks, with further efficiency gains.

Gray and Moore (2003). Nonparametric Density Estimation: Toward Computational Tractability

For example implementations, see scikit-learn (python) and the kernel density estimation toolbox (Matlab).

Another strategy for accelerating KDEs is binning/gridding. In this approach, the data are approximated using a set of bins, which receive weight according to the proximity of data points. The KDE is also evaluated at the bins (evaluation points will no longer be exact, which may or may not work for your application). This strategy reduces the number of kernel evaluations at the cost of accuracy. Further performance gains can be had by using the fast Fourier transform (FFT) to compute the binned KDE.

Hall and Wand (1994). On the Accuracy of Binned Kernel Density Estimators

Silverman (1982). Kernel Density Estimation Using the Fast Fourier Transform

Gray and Moore (2003). [Above]

For example implementation using FFTs, see StatsModels (python).

This blog post discusses tree-based and FFT-based KDEs, and gives some benchmark results.

If you care about evaluation time but not training time, you could use a mixture model instead of a KDE (KDEs being a special case of mixture models, with a mixture component centered over each data point, all sharing the same width). Using fewer mixture components will speed up evaluation time. It may be possible to combine the first strategy (compact support + efficient search) with mixture models.

Another option would be to use interpolation. Fit whatever density estimator you like and evaluate it at a fixed number of points $y = \{y_1, ..., y_m\}$, giving corresponding density values $p = \{p_1, ..., p_m\}$. To evaluate the density at a new point $x'$, perform interpolation using $y$ and $p$. This could be simple linear interpolation, or something fancier like splines. Decreasing $m$ will give a less faithful representation of the density estimate, but faster evaluation times. You could even choose $y$ adaptively, since fewer points will be required in regions where the function is smoother. Interpolation will require finding points in $y$ that are near to $x'$. As before, efficient search procedures can be used.