# Estimate of parameter of exponential distribution with binned data

I have the following data, which can be modeled by exponential distribution

Time        0-20  20-40    40-60  60-90   90-120    120-inf
Frequency   41     19       16      13        9        2


In order to test if the data follow exponential distribution i will use chi-squared test-statistic. But for this I also need to compute lambda ($MLE = \frac{1}{\bar X}$).

So my question is: how should we choose the midpoint of the interval, if the last interval is from 120 to infinity?

I would not use the midpoint for any of those intervals (expect perhaps as an initial guess for some iterative procedure).

If the data were really from an exponential distribution, the values within each bin should be right skew; the mean would be expected to be left of the average of the bin boundaries.

Note that the equation $\hat{\lambda}=\frac{1}{\bar{X}}$ is suitable if you have all the data. With binned data you need to maximize the likelihood for a binned (i.e. interval-censored) exponential.

[The contribution to log-likelihood of the $n_i$ observations in bin $i$ — those between $l_i$ and $u_i$ — is $n_i \log(F(l_i)-F(u_i))$ (where the two terms in $F$ are functions of the parameter(s) of the distribution).]

Because of the lack of memory property of the exponential, if you have a good approximation for the mean of the exponential you also have a good approximation of the amount by which the mean of the distribution above some value $x_0$ exceeds $x_0$.

So (assuming you don’t directly maximize the likelihood* on the interval censored data as I suggested), you could begin with some approximate estimate of the mean ($m^{(0)}$ say) and use $120+m^{(0)}$ as a “centre” of the upper tail.

This might then be used to get a better estimate of the parameter (and hence of the mean) and so obtain an improved estimate of the conditional mean in each bin including the top one. [If you want such an approach I would perhaps lean toward doing EM directly.]

Several simple estimates of the mean can be obtained quickly. For example, since 41% of the values occur below 20, $\exp(-\frac{20}{\hat{\lambda}^{(0)}})=1-0.41$ which corresponds to an estimate of the mean close to $38$. Alternatively, one can get a quick eyeball estimate of the median (something less than 30, perhaps about 28), so the mean should be somewhere near $28/\log(2)$, or around $40$.

Either of these would be reasonable to use as an initial guess at how far above 120 to place an estimate for the conditional mean for the last bin.

* An alternative to maximizing the likelihood would be to minimize the chi-square statistic; the same adjustment to d.f. would be used in that instance. The chi-square statistic is relatively easy to calculate, and pretty simple to optimize for a single parameter: 