StatisticsEmpirical CDF Convergence
Let's revisit the observation from the first section that the CDF of the empirical distribution of an independent list of observations from a distribution tends to be close to the CDF of the distribution itself. The Glivenko-Cantelli theorem is a mathematical formulation of the idea that these two functions are indeed close.
Theorem (Glivenko-Cantelli)
If is the CDF of a distribution and is the CDF of the empirical distribution of observations from , then converges to along the whole number line:
The Dvoretzky-Kiefer-Wolfowitz inequality quantifies this result by providing a confidence band.
Theorem (Dvoretzky-Kiefer-Wolfowitz inequality)
If are independent random variables with common CDF , then for all , we have
In other words, the probability that the graph of lies in the -band around (or vice versa) is at least .
Exercise
Use the code cell below to experiment with various values of , and confirm that the graph usually falls inside the DKW confidence band.
Hint: after running the cell once, you can comment out the first plot command and run the cell repeatedly to get lots of empirical CDFs to show up on the same graph.
using Plots, Distributions, Roots n = 50 ϵ = find_zero(ϵ -> 2exp(-2n*ϵ^2) - 0.1, 0.05) xs = range(0, 8, length=100) plot(xs, x-> 1-exp(-x) + ϵ, fillrange = x-> 1-exp(-x) - ϵ, label = "true CDF", legend = :bottomright, fillalpha = 0.2, linewidth = 0) plot!(sort(rand(Exponential(1),n)), (1:n)/n, seriestype = :steppre, label = "empirical CDF")
As increases, the confidence band
Exercise
Show that if , then with probability at least , we have for all .
Solution. If we substitute the given value of , the right-hand side reduces to . So the desired equation follows from the DKW inequality.