3.5.2 The No-Free-Lunch Theorem (Full Proof)
The companion textbook’s Chapter 1 (The Objects of Learning) 1 gave the statement and a brief argument. Here we give the full proof with the averaging argument over all target functions.
Let \(|X| \geq 2m\). For any learning algorithm \(A\),
where \(D_c\) is the uniform distribution on \(X\) with target \(c\).
Instead of proving the \(\max \), we prove the stronger statement with \(\mathbb {E}_c\) (expectation over a uniform random target), which implies the \(\max \) is at least as large.
Fix any algorithm \(A\). Let \(S = (x_1, \ldots , x_m)\) be drawn uniformly from \(X\) (i.i.d.), and let \(c\) be drawn uniformly from \(\{ 0,1\} ^X\). Write \(T = X \setminus \{ x_1, \ldots , x_m\} \) for the unseen points.
Key observation. Conditioned on \(S\) and on the labels \((c(x_1), \ldots , c(x_m))\), the labels \(\{ c(x) : x \in T\} \) are still independent uniform bits. This is because the prior on \(c\) is product measure.
Therefore, for each \(x \in T\), regardless of what \(A(S)\) predicts:
The expected risk on unseen points is:
using \(|T| \geq |X|/2\) (since \(m \leq |X|/2\)). Taking expectation over \(S\) preserves the bound.
The NFL theorem is the structural reason that the VC characterization involves a restriction on \(\mathcal{H}\). Without restricting to a class of finite VC dimension, no learning is possible. The Fundamental Theorem can therefore be read as: “inductive bias (finite VC dimension) is both necessary and sufficient for statistical learning.”