Online estimation of quartiles without storing observations

I need to compute quartiles (Q1,median and Q3) in real-time on a large set of data without storing the observations. I first tried the P square algorithm (Jain/Chlamtac) but I was no satisfied with it (a bit too much cpu use and not convinced by the precision at least on my dataset).

I use now the FAME algorithm (Feldman/Shavitt) for estimating the median on the fly and try to derivate the algorithm to compute also Q1 and Q3 :

M = Q1 = Q3 = first data value 
step =step_Q1 = step_Q3 = a small value
for each new data :
        # update median M 
        if M > data:
            M = M - step
        elif M < data:
            M = M + step
        if abs(data-M) < step:
            step = step /2

        # estimate Q1 using M
        if data < M:
            if Q1 > data:
                Q1 = Q1 - step_Q1
            elif Q1 < data:
                Q1 = Q1 + step_Q1
            if abs(data - Q1) < step_Q1:
                step_Q1 = step_Q1/2
        # estimate Q3 using M
        elif data > M:
            if Q3 > data:
                Q3 = Q3 - step_Q3
            elif Q3 < data:
                Q3 = Q3 + step_Q3
            if abs(data-Q3) < step_Q3:
                step_Q3 = step_Q3 /2

To resume, it simply uses median M obtained on the fly to divide the data set in two and then reuse the same algorithm for both Q1 and Q3.

This appears to work somehow but I am not able to demonstrate (I am not a mathematician) . Is it flawned ?
I would appreciate any suggestion or eventual other technique fitting the problem.

Thank you very much for your Help !

==== EDIT =====

For those who are interested by such questions, after a few weeks, I finally ended by simply using Reservoir Sampling with a revervoir of 100 values and it gave very satistfying results (to me).

Answer

The median is the point at which 1/2 the observations fall below and 1/2 above. Similarly, the 25th perecentile is the median for data between the min and the median, and the 75th percentile is the median between the median and the max, so yes, I think you’re on solid ground applying whatever median algorithm you use first on the entire data set to partition it, and then on the two resulting pieces.

Update:

This question on stackoverflow leads to this paper: Raj Jain, Imrich Chlamtac: The P² Algorithm for Dynamic Calculation of Quantiiles and Histograms Without Storing Observations. Commun. ACM 28(10): 1076-1085 (1985) whose abstract indicates it’s probably of great interest to you:

A heuristic algorithm is proposed for dynamic calculation qf the
median and other quantiles. The estimates are produced dynamically as
the observations are generated. The observations are not stored;
therefore, the algorithm has a very small and fixed storage
requirement regardless of the number of observations. This makes it
ideal for implementing in a quantile chip that can be used in
industrial controllers and recorders. The algorithm is further
extended to histogram plotting. The accuracy of the algorithm is
analyzed.

Attribution
Source : Link , Question Author : Louis Hugues , Answer Author : Community

Leave a Comment