Formal Learning Theory Kernel: Blueprint v1

3.2.4 Stage 4: Infinite VC Dimension Implies Failure

Theorem 3.15 Converse of the VC Characterization

If \(\mathrm{VCdim}(\mathcal{H}) = \infty \), then \(\mathcal{H}\) is not PAC learnable.

Proof

The proof constructs, for any candidate learner \(A\) and any sample size \(m\), a distribution on which \(A\) must fail. The construction is adversarial.

Since \(\mathrm{VCdim}(\mathcal{H}) = \infty \), there exists a shattered set \(C = \{ x_1, \ldots , x_{2m}\} \subseteq X\) of size \(2m\). By the definition of shattering, for every labeling \(b \in \{ 0,1\} ^{2m}\), there exists \(h_b \in \mathcal{H}\) with \(h_b(x_i) = b_i\) for all \(i\).

Now consider the uniform distribution \(D_b\) on \(C\) with target concept \(h_b\). Run the learner \(A\) on a sample \(S\) of size \(m\) drawn uniformly from \(C\). With high probability, at least \(m\) points of \(C\) are unseen (by a coupon-collector argument, at least \(m/2\) are unseen in expectation; we use \(2m\) points to ensure \(\geq m\) unseen with constant probability).

On the unseen points, the learner has no information about the target labeling. We apply a probabilistic argument over \(b\) drawn uniformly from \(\{ 0,1\} ^{2m}\):

Key claim. For any fixed algorithm \(A\) and fixed training set \(S\), if we draw \(b\) uniformly at random, then for any hypothesis \(A(S)\) outputs, the expected error on unseen points is at least \(1/2 \cdot (m/(2m)) = 1/4\).

This follows because, conditioned on \(S\), the labels \(b_j\) for unseen points \(x_j \notin S\) are independent uniform bits under the uniform distribution on target functions. Therefore \(A(S)\)’s prediction on each unseen point is correct with probability exactly \(1/2\), regardless of the learner’s strategy.

Since the unseen points constitute at least half the support of \(D_b\), the expected true risk satisfies \(\mathbb {E}_b[R_{D_b}(A(S))] \geq 1/4\). In particular, there exists a specific \(b^*\) such that \(R_{D_{b^*}}(A(S)) \geq 1/4\), which means \(A\) fails to achieve \(\varepsilon {\lt} 1/4\) with \(m\) samples under distribution \(D_{b^*}\). Since \(m\) was arbitrary, \(\mathcal{H}\) is not PAC learnable.

Remark 3.16 Strength of the converse

The converse is stronger than “not uniformly convergent”: it says no learner whatsoever, not just ERM, can PAC learn \(\mathcal{H}\). The argument works because the adversary chooses the distribution after seeing the learner. This distribution-free adversarial structure is the engine that makes VC dimension necessary, not just sufficient.

The circle closes. Finite VC dimension forces polynomial growth (Sauer–Shelah), which forces uniform convergence, which makes ERM succeed, which gives PAC learnability. Infinite VC dimension shatters arbitrarily large sets, and the converse kills learnability outright. A single combinatorial quantity, the largest set the class can shatter, controls everything.