3.2.2 Stage 2: Uniform Convergence
The uniform convergence property says that empirical risk converges to true risk simultaneously for all hypotheses in \(\mathcal{H}\), not just for a single fixed \(h\).
Lean: HasUniformConvergence
A hypothesis class \(\mathcal{H}\) has the uniform convergence property if for every \(\varepsilon , \delta {\gt} 0\), there exists \(m_{\mathrm{UC}}(\varepsilon , \delta )\) such that for all distributions \(D\), whenever \(m \geq m_{\mathrm{UC}}(\varepsilon , \delta )\):
If \(\Pi _\mathcal {H}(m) \leq (em/d)^d\) for all \(m \geq d\), then \(\mathcal{H}\) has the uniform convergence property with
In the realizable setting, the \(\varepsilon ^2\) denominator improves to \(\varepsilon \), giving \(m = O\! \left(\frac{d\log (1/\varepsilon ) + \log (1/\delta )}{\varepsilon }\right)\).
The proof uses the symmetrization (or “ghost sample”) technique of Vapnik and Chervonenkis [ .
Step 1: Symmetrization. Draw two independent samples \(S, S' \sim D^m\). We claim that for \(m \geq 8/\varepsilon ^2\),
This follows because if \(|R_D(h) - \hat{R}_S(h)| {\gt} \varepsilon \) for some \(h\), then with probability at least \(1/2\) (by a Chebyshev argument on \(S'\)), the ghost sample satisfies \(|\hat{R}_{S'}(h) - R_D(h)| \leq \varepsilon /2\), and hence \(|\hat{R}_S(h) - \hat{R}_{S'}(h)| {\gt} \varepsilon /2\) by the triangle inequality.
Step 2: Permutation argument. Let \(T = S \cup S'\) be the pooled sample of size \(2m\). Conditioned on \(T\), the partition into \(S\) and \(S'\) is uniformly random among all \(\binom {2m}{m}\) splits. By a union bound over the distinct label patterns that \(\mathcal{H}\) induces on \(T\):
The number of distinct behaviors is at most \(\Pi _\mathcal {H}(2m)\), and for each fixed labeling pattern, \(\hat{R}_S(h) - \hat{R}_{S'}(h)\) is a sum of \(2m\) centered random variables (each \(\pm 1/m\) depending on which half the point falls in).
Step 3: Hoeffding bound. For each fixed label pattern, Hoeffding’s inequality gives
Combining with the Sauer–Shelah bound:
Setting this \(\leq \delta \) and solving for \(m\) gives
Realizable improvement. In the realizable setting, \(R_D(c^*) = 0\), so only one-sided deviations matter: we need \(\Pr [\exists h \in \mathcal{H} : \hat{R}_S(h) = 0 \text{ but } R_D(h) {\gt} \varepsilon ] \leq \delta \). For any fixed \(h\) with \(R_D(h) {\gt} \varepsilon \), the probability that \(h\) is consistent with all \(m\) samples is at most \((1 - \varepsilon )^m \leq e^{-m\varepsilon }\). A union bound over the \(\Pi _\mathcal {H}(m)\) distinct behaviors gives
Setting this \(\leq \delta \) yields \(m = O\! \left(\frac{d\log (d/\varepsilon ) + \log (1/\delta )}{\varepsilon }\right)\), confirming the \(1/\varepsilon \) dependence.
The optimal sample complexity in the realizable case is \(\Theta \! \left(\frac{d + \log (1/\delta )}{\varepsilon }\right)\), achieved by a careful one-inclusion graph analysis. The \(\log (d/\varepsilon )\) factor in the union bound above is a mild overhead; Hanneke [ showed it can be removed entirely, establishing the tight bound.