I need to compute quartiles (Q1,median and Q3) in real-time on a large set of data without storing the observations. I first tried the P square algorithm (Jain/Chlamtac) but I was no satisfied with it (a bit too much cpu use and not convinced by the precision at least on my dataset).
I use now the FAME algorithm (Feldman/Shavitt) for estimating the median on the fly and try to derivate the algorithm to compute also Q1 and Q3 :
M = Q1 = Q3 = first data value step =step_Q1 = step_Q3 = a small value for each new data : # update median M if M > data: M = M - step elif M < data: M = M + step if abs(data-M) < step: step = step /2 # estimate Q1 using M if data < M: if Q1 > data: Q1 = Q1 - step_Q1 elif Q1 < data: Q1 = Q1 + step_Q1 if abs(data - Q1) < step_Q1: step_Q1 = step_Q1/2 # estimate Q3 using M elif data > M: if Q3 > data: Q3 = Q3 - step_Q3 elif Q3 < data: Q3 = Q3 + step_Q3 if abs(data-Q3) < step_Q3: step_Q3 = step_Q3 /2
To resume, it simply uses median M obtained on the fly to divide the data set in two and then reuse the same algorithm for both Q1 and Q3.
This appears to work somehow but I am not able to demonstrate (I am not a mathematician) . Is it flawned ?
I would appreciate any suggestion or eventual other technique fitting the problem.
Thank you very much for your Help !
==== EDIT =====
For those who are interested by such questions, after a few weeks, I finally ended by simply using Reservoir Sampling with a revervoir of 100 values and it gave very satistfying results (to me).
The median is the point at which 1/2 the observations fall below and 1/2 above. Similarly, the 25th perecentile is the median for data between the min and the median, and the 75th percentile is the median between the median and the max, so yes, I think you’re on solid ground applying whatever median algorithm you use first on the entire data set to partition it, and then on the two resulting pieces.
This question on stackoverflow leads to this paper: Raj Jain, Imrich Chlamtac: The P² Algorithm for Dynamic Calculation of Quantiiles and Histograms Without Storing Observations. Commun. ACM 28(10): 1076-1085 (1985) whose abstract indicates it’s probably of great interest to you:
A heuristic algorithm is proposed for dynamic calculation qf the
median and other quantiles. The estimates are produced dynamically as
the observations are generated. The observations are not stored;
therefore, the algorithm has a very small and fixed storage
requirement regardless of the number of observations. This makes it
ideal for implementing in a quantile chip that can be used in
industrial controllers and recorders. The algorithm is further
extended to histogram plotting. The accuracy of the algorithm is