What Is the Distribution of Sample Quantiles?

This is a translation of my original article on habr.com.

Sample means, sample variances, sample quantiles, and other statistics are random variables by nature. Knowing their distributions helps us build more accurate confidence intervals, test statistical hypotheses, and satisfy simple curiosity.

Most statistics students remember that the sample mean is approximately normally distributed under certain conditions (the Central Limit Theorem). These conditions hold for a wide class of distributions, making the result extremely useful in practice.

Stronger students may also recall the distribution of the sample variance: if we observe (n) independent normally distributed random variables with identical parameters, then $ \frac{n\hat{\sigma}^2}{\sigma^2} $ follows a chi-square distribution with (n-1) degrees of freedom (Cochran's theorem).

This fact is less widely known but still useful when constructing confidence intervals.

Sample quantiles, however, are often overlooked, despite being widely used statistics. We'd also like to build accurate confidence intervals for them. In this note, I collected information about the distributions of sample quantiles and investigated when theoretical results work well in practice. I won't provide proofs, but I will include Python code so the experiments can be reproduced.

Theorem on the Distribution of Sample Quantiles

Some textbooks present the following theorem:

Let $ (X_1,\dots,X_n) $ be a sample from a distribution with density $ f(x) $. Let $ z_p $ be the true quantile of level $p$, and $z_{n,p}$ the corresponding sample quantile.

If $f'(x)$ is continuously differentiable in a neighborhood of $z_p$ and $f(z_p) > 0$, then

$ \sqrt{n}(z_p - z_{n,p}) \Rightarrow N\left(0,\frac{p(1-p)}{f(z_p)^2}\right) $

In other words, for a certain class of distributions, the sample quantile is approximately normally distributed around the true quantile with variance $ \frac{p(1-p)}{n,f(z_p)^2} $ .

We will use this theorem and examine how large (n) must be for the approximation to work well.

Distribution via Order Statistics

For a sample of size $n$, the quantile of level $p$ is approximately equal to the order statistic $X_{(k)}$, where $ k \approx np $.

If $ X_{(k)} \le x $, then at least $k$ sample elements are less than or equal to $x$. Therefore, its cumulative distribution function is:

$$F_{X_{(k)}}(x) = \sum_{i=k}^{n} \binom{n}{i} F(x)^i (1-F(x))^{n-i}$$

The corresponding density is:

$$f_{X_{(k)}}(x)= \frac{n!}{(n-k)!(k-1)!} (1-F(x))^{k-1} F(x)^{n-k} f(x)$$

Thus, $ f_{z_{n,p}}(x) \approx f_{X_{(\lfloor np\rfloor)}}(x) $, meaning that the sample quantile is approximately distributed as the corresponding order statistic.

The approximation $k \approx np$ may create difficulties when $np$ is not an integer and must be rounded. This is especially noticeable when the fractional part equals 0.5. Even then, however, the approximation is usable in practice.

A further complication is numerical: both the numerator and denominator contain factorials that grow extremely quickly. For example, when (n=15), the numerator is already about $1.3\times10^{12}$.

Which Approximation Should We Use?

We now know that:

For sufficiently large samples, sample quantiles become approximately normal.
They can also be approximated using the distribution of the corresponding order statistic.

How different are these approaches in practice?

To investigate, I generated sample quantiles from uniform, exponential, and normal distributions and compared the empirical distributions with:

the normal approximation,
the order-statistic approximation.

I measured the distance between distributions using the Wasserstein metric. Smaller values indicate a better match.

Uniform Distribution

The normal approximation is not valid for quantiles at levels 0 and 1.

Intuitively, a normal distribution centered at the edge of a uniform distribution cannot be correct because observations near the boundary only come from one side, creating asymmetry. The theorem also does not apply because the density is not differentiable at the boundaries.

The order statistic always exists, however. For a $U(0,1)$ distribution, the order statistic follows a Beta distribution: $ B(k,n-k+1) $.

Therefore, the sample quantile approximately follows $ B(np,n-np+1) $.

As $n$ increases and $p\neq0,1$:

$$z_{n,p} \sim N\left( p,\frac{p(1-p)}{n} \right)$$

The same idea extends to $U(a,b)$ after centering and scaling.

Results

For the 10th percentile $(p=0.1)$:

For sample sizes below 100, the order-statistic approximation is substantially more accurate.
For (n>100), both approximations become quite similar.

For the median $(p=0.5)$:

The normal approximation converges much faster.
Around $n>30$, it performs almost as well as the order-statistic approach.

For the 90th percentile and extreme quantiles such as 0.01 or 0.99:

The normal approximation again becomes reliable only when $n>100$.

Exponential Distribution

The exponential distribution satisfies the quantile theorem everywhere except at 0.

The order statistic does not simplify into a special named distribution, so we must use the general formula.

The results are very similar to those of the uniform distribution:

Quantiles far from the median require roughly $n>100$ before the normal approximation works well.
The median becomes approximately normal much sooner, around $n>30$.

Normal Distribution

For the normal distribution, the theorem applies everywhere.

The convergence pattern is essentially the same as for the uniform and exponential cases:

Medians converge quickly.
Tail quantiles require larger samples.
Around $n>100$, the normal approximation is nearly as accurate as the order-statistic approximation for all quantiles.

Python Code

To compute the results, for each sample size $n$ from N_LIST, I generated a sample of size $n$ from a uniform, exponential, or normal distribution. I then computed the sample quantile for each quantile level $p$ from P_LIST, stored the result, and repeated this process SAMPLE_SIZE times.

As a result, for every quantile level and every sample size, I obtained SAMPLE_SIZE realizations of the sample quantile.

Next, I generated samples of the same size (SAMPLE_SIZE) from the corresponding normal approximation and from the distribution of the relevant order statistic. I then calculated the distance between each of these distributions and the empirically observed distribution of sample quantiles.

I repeated these distance calculations AVG_NUM times, averaged the results, and stored the final value.

The complete Python code can be found below.

import math

import numpy as np
from scipy.stats import norm, wasserstein_distance


N_LIST = [10, 20, 30, 50, 100, 500]
P_LIST = [0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99]

SAMPLE_SIZE = 1000
AVG_NUM = 30

MU, SIGMA = 1, 2


def norm_order_pdf(x, mu, sigma, n, k):
    tmp_factorial = 1
    for i in range(n, n-k, -1):
        tmp_factorial *= i

    return_val =  tmp_factorial /\
    math.factorial(k-1) \
    * norm.cdf(x, loc=mu, scale=sigma)**(k-1) \
    * (1 - norm.cdf(x, loc=mu, scale=sigma))**(n-k) \
    * norm.pdf(x, loc=mu, scale=sigma)

    return return_val


def calc_distances():
    norm_var = norm(MU, SIGMA)
    
    norm_approx_res = {}
    order_approx_res = {}
    
    for p in P_LIST:
        norm_approx_res[p] = {}
        order_approx_res[p] = {}
        for n in N_LIST:
            norm_approx_distances = []
            order_approx_distances = []
            for _ in range(AVG_NUM):
                q_sample = []
                for i in range(SAMPLE_SIZE):
                    norm_sample = norm_var.rvs(size=n)
                    sample_q = np.quantile(norm_sample, p)
                    q_sample.append(sample_q)
                x0 = norm_var.pdf(norm_var.ppf(p))
                norm_approx_distances.append(
                    wasserstein_distance(
                        q_sample, 
                        norm(loc=norm_var.ppf(p), scale=(p * (1-p) / (n * (x0**2))) ** 0.5).rvs(SAMPLE_SIZE)
                    )
                )
                k = round(n * p) + int(p < 0.5)
                x = np.linspace(np.min(q_sample), np.max(q_sample), 50)
                theor_pdf = norm_order_pdf(x, MU, SIGMA, n, k)
                order_sample = np.random.choice(x, p=theor_pdf/theor_pdf.sum(), size=SIGMA)
                order_approx_distances.append(wasserstein_distance(q_sample, order_sample))
            norm_approx_res[p][n] = np.mean(norm_approx_distances)
            order_approx_res[p][n] = np.mean(order_approx_distances)
    return norm_approx_res, order_approx_res


calc_distances()

Conclusions

The distribution of sample quantiles can be approximated either by the distribution of the corresponding order statistic or by a normal distribution.
The order-statistic approach is generally more accurate for small samples, but it is also more computationally demanding.
For sample sizes below about 100, the order-statistic approximation is usually superior.
If the quantile is near the median and the theorem's assumptions hold, the normal approximation performs almost as well even for smaller samples.
For sample sizes above about 100, the normal approximation is nearly as accurate as the order-statistic approximation for quantiles of all levels.
The asymptotic theorem fails at points where the density is not differentiable:
- at the boundaries of the uniform distribution,
- at 0 for the exponential distribution,
- at similar singular points in other distributions (e.g., 0 for the Laplace distribution).

The practical takeaway is that confidence intervals can be constructed not only for means but also for quantiles, including the median, with a good understanding of the underlying approximation.

What Is the Distribution of Sample Quantiles?

Theorem on the Distribution of Sample Quantiles

Distribution via Order Statistics

Which Approximation Should We Use?

Uniform Distribution

Results

Exponential Distribution

Normal Distribution

Python Code

Conclusions

Comments

More from this blog

Automatically Extracting Piecewise-Linear Trends from a Time Series

CADE — An Interesting Approach to Finding Anomalies in Multidimensional Data

How and Why to Compute a Balanced Precision Score

Command Palette

Theorem on the Distribution of Sample Quantiles

Distribution via Order Statistics

Which Approximation Should We Use?

Uniform Distribution

Results

Exponential Distribution

Normal Distribution

Python Code

Conclusions

Comments

More from this blog