Formal Learning Theory Kernel: Blueprint v1: The No-Free-Lunch Theorem (Full Proof)

3.5.2 The No-Free-Lunch Theorem (Full Proof)

The companion textbook’s Chapter 1 (The Objects of Learning) ¹ gave the statement and a brief argument. Here we give the full proof with the averaging argument over all target functions.

Theorem 3.24 No Free Lunch, Full Version

Let \(|X| \geq 2m\). For any learning algorithm \(A\),

\[ \max _{c \in \{ 0,1\} ^X}\; \mathbb {E}_{S \sim D_c^m}\! \bigl[R_{D_c}(A(S))\bigr] \geq \frac{1}{4}, \]

where \(D_c\) is the uniform distribution on \(X\) with target \(c\).

Proof ▼

Instead of proving the \(\max \), we prove the stronger statement with \(\mathbb {E}_c\) (expectation over a uniform random target), which implies the \(\max \) is at least as large.

Fix any algorithm \(A\). Let \(S = (x_1, \ldots , x_m)\) be drawn uniformly from \(X\) (i.i.d.), and let \(c\) be drawn uniformly from \(\{ 0,1\} ^X\). Write \(T = X \setminus \{ x_1, \ldots , x_m\} \) for the unseen points.

Key observation. Conditioned on \(S\) and on the labels \((c(x_1), \ldots , c(x_m))\), the labels \(\{ c(x) : x \in T\} \) are still independent uniform bits. This is because the prior on \(c\) is product measure.

Therefore, for each \(x \in T\), regardless of what \(A(S)\) predicts:

\[ \Pr _{c}[A(S)(x) \neq c(x) \mid S, c|_S] = \frac{1}{2}. \]

The expected risk on unseen points is:

\[ \mathbb {E}_c\! \left[\frac{1}{|X|}\sum _{x \in T} \mathbf{1}[A(S)(x) \neq c(x)] \; \middle |\; S\right] = \frac{|T|}{|X|} \cdot \frac{1}{2} \geq \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}, \]

using \(|T| \geq |X|/2\) (since \(m \leq |X|/2\)). Taking expectation over \(S\) preserves the bound.

Remark 3.25 NFL and inductive bias, revisited

The NFL theorem is the structural reason that the VC characterization involves a restriction on \(\mathcal{H}\). Without restricting to a class of finite VC dimension, no learning is possible. The Fundamental Theorem can therefore be read as: “inductive bias (finite VC dimension) is both necessary and sufficient for statistical learning.”