# Area under the “pdf” in kernel density estimation in R

I am trying to use the ‘density‘ function in R to do kernel density estimates. I am having some difficulty interpreting the results and comparing various datasets as it seems the area under the curve is not necessarily 1. For any probability density function (pdf) $\phi(x)$, we need to have the area $\int_{-\infty}^\infty \phi(x) dx = 1$. I am assuming that the kernel density estimate reports the pdf. I am using integrate.xy from sfsmisc to estimate the area under the curve.

> # generate some data
> xx<-rnorm(10000)
> # get density
> xy <- density(xx)
> # plot it
> plot(xy)


> # load the library
> library(sfsmisc)
> integrate.xy(xy$x,xy$y)
[1] 1.000978
> # fair enough, area close to 1
> # use another bw
> xy <- density(xx,bw=.001)
> plot(xy)


> integrate.xy(xy$x,xy$y)
[1] 6.518703
> xy <- density(xx,bw=1)
> integrate.xy(xy$x,xy$y)
[1] 1.000977
> plot(xy)


> xy <- density(xx,bw=1e-6)
> integrate.xy(xy$x,xy$y)
[1] 6507.451
> plot(xy)


Shouldn’t the area under the curve always be 1? It seems small bandwidths are a problem, but sometimes you want to show the details etc. in the tails and small bandwidths are needed.

It seems that the answer below about the overestimation in convex regions is correct as increasing the number of integration points seems to lessen the problem (I didn’t try to use more than $2^{20}$ points.)

> xy <- density(xx,n=2^15,bw=.001)
> plot(xy)


> integrate.xy(xy$x,xy$y)
[1] 1.000015
> xy <- density(xx,n=2^20,bw=1e-6)
> integrate.xy(xy$x,xy$y)
[1] 2.812398


Think about the trapezoid rule integrate.xy() uses. For the normal distribution, it will underestimate the area under the curve in the interval (-1,1) where the density is concave (and hence the linear interpolation is below the true density), and overestimate it elsewhere (as the linear interpolation goes on top of the true density). Since the latter region is larger (in Lesbegue measure, if you like), the trapezoid rule tends to overestimate the integral. Now, as you move to smaller bandwidths, pretty much all of your estimate is piecewise convex, with a lot of narrow spikes corresponding to the data points, and valleys between them. That’s where the trapezoid rule breaks down especially badly.