Quantile Estimation

Estimating quantiles and comparing algorithms

For a recent project, I had to find values for the deciles (i.e., the 10- to 90-quantiles) over sets of several million numbers for a variation of Welch's algorithm. It turns out that calculating quantiles is an interesting problem in statistical/numerical computing, and also a very important one. A quantile essentially provides a way of quantizing (hence the name) a distribution to an arbitrary degree to provide a form of lossy compression. Just a few numbers can then tell us a lot about what the original distribution was. This is then useful for downstream tasks like classification, where the quantile values form a useful feature vector of the original data.

In numpy, the percentiles() function implements the definition that one learns in highschool: sort all N numbers and then pick the p*Nth number as an estimate of the q-quantile. This algorithm of course has the drawback of needing all N numbers in memory. While handling tens of millions of numbers (hundreds of millions of MiB) is no problem for a modern computer, I wanted to explore getting the entire algorithm onto a microcontroller with only 2MiB of flash memory, so this wasn't going to cut it.

As it turns out, there is a small zoo of so-called "single-pass" or "streaming" quantile algorithms, which do their best to estimate the quantiles for a series of numbers while only keeping a much smaller buffer of them in memory, increasing the amount of potential error for any given quantile. The Wikipedia article on quantiles provides a good starting point on some of these.

As a side note: many of these algorithms are used in networking applications, where one wants to e.g. estimate the 99th percentile response time for a service, but can't keep around all of the requests that might arrive to that service over all time.

I ultimately settled on a simple implementation found in Numerical Reciped (3rd edition) where the elements in the incoming buffer (typically much smaller than the total number of elements) are treated as an estimation of the underlying CDF and are summed with the CDF that is slowly built up with each new incoming buffer. This implementation required only about 40LoC without any dependencies accept for an appropriate choice of sorting function, which seemed ideal for embedded software. In this article, I'd like to now turn to getting a sense for the error it introduces.

Note, the term "estimate" is used even in the "static" algorithms that keep all of the numbers around, as in numpy's percentile() method. Calculating the quantile over any sample from the true population is never going to give us precisely what the true quantiles are. That would require knowing the full population, which is rarely the case. Even in DSP, we sample real-world signals with a given sample rate, so the we can only make guesses about what the original frequencies are, just the same as if we were conducting medical trials on test and control groups!

The best one can do when estimating quantiles even with the full, static sample is given by the standard deviation of a binomial distribution:

$\sqrt{Np(1-p)}$

(This part I don't understand fully, but it's related to the fact that asking whether a value falls below a given quantile is a yes-no question.)

As Press et. al. tell us in Numerical Recipes that we can then measure the error by asking: what index $j$ has the value $A_j$ closest to the estimate of $q$ produced by our algorithm? Then the error is given by:

$\lvert j - pN \rvert$

Which is essentially comparing where our result is to where it should have been within the original sample.

If the two error values are comparable (i.e., their ratio is $\lesssim 1$), then we know that our single-pass algorithm couldn't have done any better than the static quantile algorithm given the sample we have chosen!

Armed with this, now it's time to see how well the algorithm performs. In the implementation I've created a python wrapper which simply takes in an array and treats this as the buffer, allowing us to easily test with distributions created with numpy. To test out different distributions, we'll first make a function that takes in a generator which returns a buffer sampled from the distribution specified by the caller:

import numpy as np
from qsketch import Sketch

# Things we'll need throughout the article:
deciles = (np.arange(9) + 1) / 10
buf_size = 2048
iter_count = 100
rng = np.random.default_rng()

def measure_sketch_error(gen):
    # Our single-pass "sketch" that keeps track of the quantiles so far
    s = Sketch() 
    
    # Keep track of all values we pass into the sketch so
    # we can also compute the static quantile
    all_values = []
    for arr in gen():
        all_values.append(arr)
        s.update(arr)
    
    all_values = np.sort(np.vstack(all_values).flatten())
    qs = np.array(s.get_quantiles())
    
    N = all_values.size
    # use broadcasting to calculate each of the indexes of the values 
    # closest to the reported quantile values. argmin() returns the index
    # This essentially calculates the j for each p in |j - pN|
    js = np.abs(qs[:, None] - all_values[None, :]).argmin(axis=1)
    
    # Now calculate |j - pN|
    alg_err = np.abs(js - deciles * N)
    
    inherent_err = np.sqrt((N * deciles) * (1 - deciles))
    
    ratio = alg_err / inherent_err
    print('sketch error', ratio)

    # As a sanity check, we'll also print out the error ratio we 
    # get if we use the "static" np.quantile() function that has
    # access to all elements at once!
    qs = np.quantile(all_values, deciles)
    js = np.abs(qs[:, None] - all_values[None, :]).argmin(axis=1)
    alg_err = np.abs(js - deciles * N)
    ratio = alg_err / inherent_err
    print('np.quantile error', ratio)

Now let's take a look at how our algorithm performs with uniformly distributed data:

def uniform_generator():
    for i in range(iter_count):
        yield rng.random(buf_size) * 3

measure_sketch_error(uniform_generator)
sketch error [0.11048543 0.00552427 0.12054949 0.11727427 0.01767767 0.10825318
 0.04339782 0.12153398 0.03682848]
np.quantile error [0.         0.         0.         0.         0.         0.00451055
 0.00482198 0.00552427 0.0073657 ]

Not too bad. We see it's well below 1 for all of our deciles. As we expect, np.quantile() has practically no error. Now we can try for uniform and exponential distributions:

def normal_generator():
    for i in range(iter_count):
        yield rng.normal(0, 1, buf_size)

def exp_generator():
    for i in range(iter_count):
        yield rng.exponential(4, buf_size)

print('normal:')
measure_sketch_error(normal_generator)

print('\nexponential:')
measure_sketch_error(exp_generator)
normal:
sketch error [0.21360517 0.12153398 0.05786376 0.01353165 0.07071068 0.11727427
 0.08679563 0.03314563 0.44194174]
np.quantile error [0.         0.         0.         0.         0.00441942 0.00451055
 0.00482198 0.00552427 0.0073657 ]

exponential:
sketch error [0.07365696 0.01657282 0.00482198 0.10374263 0.08838835 0.04059494
 0.13019345 0.22097087 0.42721035]
np.quantile error [0.         0.         0.         0.         0.00441942 0.00451055
 0.00482198 0.00552427 0.0073657 ]

We begin to see higher error rates around the extreme tails of the distributions when compared to the static algorithm, which is what we might expect, although the uncertainty in these estimates is still less than one standard deviation of the associated binomial distribution.

Going any further into the distribution of the underlying real-world data requires us to first have a dataset to play with and is beyond the scope of this article. Once more is known about this distribution, e.g. whether it is multi-modal or perhaps follows some more exotic form, then this analysis would provide a useful starting point for assessing whether the Sketch algorithm could be applied. So far, it seems more than fitting even for large-tailed distributions, and there is still room to move in the implementation, say by increasing the number of or biasing the intial "fixed" quantiles upon which the estimates are based.

For future article ideas, I might try to:

  • Analyse the performance of the Sketch algorithm
  • Go into detail about how piecewise-linear interpolation works (the internals of the algorithm)