Formal Learning Theory Kernel: Blueprint v1: Stage 2: Uniform Convergence

3.2.2 Stage 2: Uniform Convergence

The uniform convergence property says that empirical risk converges to true risk simultaneously for all hypotheses in \(\mathcal{H}\), not just for a single fixed \(h\).

Definition 3.11 Uniform Convergence

Lean: HasUniformConvergence

A hypothesis class \(\mathcal{H}\) has the uniform convergence property if for every \(\varepsilon , \delta {\gt} 0\), there exists \(m_{\mathrm{UC}}(\varepsilon , \delta )\) such that for all distributions \(D\), whenever \(m \geq m_{\mathrm{UC}}(\varepsilon , \delta )\):

\[ \Pr _{S \sim D^m}\! \left[ \sup _{h \in \mathcal{H}} |R_D(h) - \hat{R}_S(h)| {\gt} \varepsilon \right] \leq \delta . \]

Theorem 3.12 Uniform Convergence from Finite Growth

If \(\Pi _\mathcal {H}(m) \leq (em/d)^d\) for all \(m \geq d\), then \(\mathcal{H}\) has the uniform convergence property with

\[ m_{\mathrm{UC}}(\varepsilon , \delta ) = O\! \left(\frac{d \log (1/\varepsilon ) + \log (1/\delta )}{\varepsilon ^2}\right). \]

In the realizable setting, the \(\varepsilon ^2\) denominator improves to \(\varepsilon \), giving \(m = O\! \left(\frac{d\log (1/\varepsilon ) + \log (1/\delta )}{\varepsilon }\right)\).

Proof ▼

The proof uses the symmetrization (or “ghost sample”) technique of Vapnik and Chervonenkis [ .

Step 1: Symmetrization. Draw two independent samples \(S, S' \sim D^m\). We claim that for \(m \geq 8/\varepsilon ^2\),

\[ \Pr _S\! \left[\sup _{h} |R_D(h) - \hat{R}_S(h)| {\gt} \varepsilon \right] \leq 2\, \Pr _{S,S'}\! \left[ \sup _h |\hat{R}_S(h) - \hat{R}_{S'}(h)| {\gt} \varepsilon /2 \right]. \]

This follows because if \(|R_D(h) - \hat{R}_S(h)| {\gt} \varepsilon \) for some \(h\), then with probability at least \(1/2\) (by a Chebyshev argument on \(S'\)), the ghost sample satisfies \(|\hat{R}_{S'}(h) - R_D(h)| \leq \varepsilon /2\), and hence \(|\hat{R}_S(h) - \hat{R}_{S'}(h)| {\gt} \varepsilon /2\) by the triangle inequality.

Step 2: Permutation argument. Let \(T = S \cup S'\) be the pooled sample of size \(2m\). Conditioned on \(T\), the partition into \(S\) and \(S'\) is uniformly random among all \(\binom {2m}{m}\) splits. By a union bound over the distinct label patterns that \(\mathcal{H}\) induces on \(T\):

\[ \Pr _{S,S'}\! \left[\sup _h |\hat{R}_S(h) - \hat{R}_{S'}(h)| {\gt} \varepsilon /2\right] \leq \Pi _\mathcal {H}(2m) \cdot \max _h \Pr _{\text{split}}\! \bigl[|\hat{R}_S(h) - \hat{R}_{S'}(h)| {\gt} \varepsilon /2\bigr]. \]

The number of distinct behaviors is at most \(\Pi _\mathcal {H}(2m)\), and for each fixed labeling pattern, \(\hat{R}_S(h) - \hat{R}_{S'}(h)\) is a sum of \(2m\) centered random variables (each \(\pm 1/m\) depending on which half the point falls in).

Step 3: Hoeffding bound. For each fixed label pattern, Hoeffding’s inequality gives

\[ \Pr _{\text{split}}\! \bigl[|\hat{R}_S(h) - \hat{R}_{S'}(h)| {\gt} \varepsilon /2\bigr] \leq 2\exp (-m\varepsilon ^2/8). \]

Combining with the Sauer–Shelah bound:

\[ \Pr _S\! \left[\sup _h |R_D(h) - \hat{R}_S(h)| {\gt} \varepsilon \right] \leq 4 \left(\frac{2em}{d}\right)^d \exp \! \left(-\frac{m\varepsilon ^2}{8}\right). \]

Setting this \(\leq \delta \) and solving for \(m\) gives

\[ m = O\! \left(\frac{d\log (1/\varepsilon ) + \log (1/\delta )}{\varepsilon ^2}\right). \]

Realizable improvement. In the realizable setting, \(R_D(c^*) = 0\), so only one-sided deviations matter: we need \(\Pr [\exists h \in \mathcal{H} : \hat{R}_S(h) = 0 \text{ but } R_D(h) {\gt} \varepsilon ] \leq \delta \). For any fixed \(h\) with \(R_D(h) {\gt} \varepsilon \), the probability that \(h\) is consistent with all \(m\) samples is at most \((1 - \varepsilon )^m \leq e^{-m\varepsilon }\). A union bound over the \(\Pi _\mathcal {H}(m)\) distinct behaviors gives

\[ \Pr [\text{bad}] \leq \left(\frac{em}{d}\right)^d e^{-m\varepsilon }. \]

Setting this \(\leq \delta \) yields \(m = O\! \left(\frac{d\log (d/\varepsilon ) + \log (1/\delta )}{\varepsilon }\right)\), confirming the \(1/\varepsilon \) dependence.

Remark 3.13 The Hanneke refinement

The optimal sample complexity in the realizable case is \(\Theta \! \left(\frac{d + \log (1/\delta )}{\varepsilon }\right)\), achieved by a careful one-inclusion graph analysis. The \(\log (d/\varepsilon )\) factor in the union bound above is a mild overhead; Hanneke [ showed it can be removed entirely, establishing the tight bound.