3.4 The Agnostic Setting and the \(\varepsilon ^2\) Price

Definition 3.20 Agnostic PAC Learning

A hypothesis class \(\mathcal{H}\) is agnostic PAC learnable if there exists an algorithm \(A\) and a function \(m_\mathcal {H}(\varepsilon , \delta )\) such that for every \(\varepsilon , \delta \in (0,1)\) and every distribution \(D\) on \(X \times \{ 0,1\} \) (not necessarily generated by any \(c^* \in \mathcal{H}\)), whenever \(m \geq m_\mathcal {H}(\varepsilon , \delta )\):

\[ \Pr _{S \sim D^m}\! \bigl[ R_D(A(S)) - \min _{h \in \mathcal{H}} R_D(h) {\gt} \varepsilon \bigr] \leq \delta . \]

The Fundamental Theorem tells us that agnostic PAC learnability is equivalent to realizable PAC learnability (1 \(\Leftrightarrow \) 2). The same classes are learnable. But the sample complexity changes, and this change is not cosmetic.

Theorem 3.21 The Agnostic Sample Complexity Gap

Let \(d = \mathrm{VCdim}(\mathcal{H})\).

In the realizable setting: \(m(\varepsilon , \delta ) = \Theta \! \left(\frac{d + \log (1/\delta )}{\varepsilon }\right)\).
In the agnostic setting: \(m(\varepsilon , \delta ) = \Theta \! \left(\frac{d + \log (1/\delta )}{\varepsilon ^2}\right)\).

The gap from \(1/\varepsilon \) to \(1/\varepsilon ^2\) is tight: no algorithm can achieve \(o(d/\varepsilon ^2)\) in the agnostic setting, and no algorithm needs \(\omega (d/\varepsilon )\) in the realizable setting.

Where does the gap come from? It is tempting to blame the proof technique, perhaps a cleverer argument could recover \(1/\varepsilon \) in the agnostic case. It cannot. The gap is fundamental, and the reason is information-theoretic.

Proof sketch of the agnostic lower bound ▼

Consider the class of threshold functions on \([0,1]\) (VC dimension 1, so \(d = 1\) for simplicity). In the realizable case, a single misclassified example pins down the threshold to within \(\varepsilon \)-precision, requiring \(O(1/\varepsilon )\) samples.

In the agnostic case, the distribution \(D\) on \((X, Y)\) may have a base error rate \(\eta = \min _h R_D(h) {\gt} 0\), and the learner must distinguish between two hypotheses \(h_1, h_2\) whose risks differ by \(\varepsilon \): \(R_D(h_1) = \eta \) and \(R_D(h_2) = \eta + \varepsilon \). Distinguishing these requires detecting a difference \(\varepsilon \) against a background noise level \(\eta \).

This is a hypothesis testing problem. By standard lower bounds (Le Cam’s method or Fano’s inequality), distinguishing distributions at total variation distance \(O(\varepsilon )\) from \(m\) samples requires \(m = \Omega (1/\varepsilon ^2)\). The noise “contaminates” the signal: in the realizable case, an error is always informative (it eliminates hypotheses), but in the agnostic case, each error might be noise or signal, and separating the two costs the extra \(1/\varepsilon \) factor.

Remark 3.22 The revelation

The \(\varepsilon \) vs. \(\varepsilon ^2\) gap is the first structural surprise of statistical learning theory that cannot be anticipated from finite-class arguments. For finite \(|\mathcal{H}|\), both realizable and agnostic sample complexities have the same dependence on \(\varepsilon \) (up to the \(\log |\mathcal{H}|\) factor). The gap emerges only when \(\mathcal{H}\) is infinite and VC dimension replaces \(\log |\mathcal{H}|\). Realizability is not just a simplifying assumption, it provides a qualitatively different information structure.