3.1 The PAC Framework
Two assumptions frame the discussion.
Lean: BatchLearner
Learning is realizable if the target concept \(c^*\) lies in the hypothesis class: \(c^* \in \mathcal{H}\).
Lean: BatchLearner
Learning is agnostic if no assumption is made on the relationship between \(c^*\) and \(\mathcal{H}\). The learner competes with the best hypothesis \(h^* = \arg \min _{h \in \mathcal{H}} R_D(h)\).
The realizable setting is a special case of the agnostic one (take \(h^* = c^*\), achieving zero risk). The agnostic setting is the one that matters in practice, but the realizable setting is where the cleanest characterization lives. We begin there.
Lean: PACLearnable
A hypothesis class \(\mathcal{H}\) over domain \(X\) is PAC learnable if there exists an algorithm \(A\) and a function \(m_\mathcal {H} \colon (0,1)^2 \to \mathbb {N}\) such that, for every \(\varepsilon , \delta \in (0,1)\) and every distribution \(D\) on \(X\):
If \(c^* \in \mathcal{H}\) and \(S \sim D^m\) with \(m \geq m_\mathcal {H}(\varepsilon , \delta )\), then
The function \(m_\mathcal {H}(\varepsilon , \delta )\) is the sample complexity of \(\mathcal{H}\). When \(m_\mathcal {H}\) is polynomial in \(1/\varepsilon \) and \(1/\delta \), the class is efficiently PAC learnable (information-theoretically); computational efficiency is a separate requirement.
Before unpacking the definition, we illustrate the full PAC cycle on a concrete class where every step is visible.
Let \(\mathcal{H}\) be the class of axis-aligned rectangles in \(\mathbb {R}^2\): \(h_{(a_1,a_2,b_1,b_2)}(x) = \mathbf{1}[a_1 \leq x_1 \leq b_1 \text{ and } a_2 \leq x_2 \leq b_2]\).
VC dimension. \(\mathrm{VCdim}(\mathcal{H}) = 4\). Four points, one near each side of a rectangle, can be shattered: for each subset \(T\) of the four points, the rectangle that tightly encloses exactly \(T\) labels precisely \(T\) as positive. No five points can be shattered: given any five points in \(\mathbb {R}^2\), at least one is “interior” (not extremal in any coordinate), and the labeling that excludes only that point cannot be realized by a rectangle.
ERM and the error region. Given sample \(S\) consistent with target rectangle \(R^*\), the ERM learner returns the tightest enclosing rectangle \(R_S\) of the positive examples. Since all positive examples lie in \(R^*\), we have \(R_S \subseteq R^*\): zero false positives. The error region \(R^* \setminus R_S\) consists of four strips, one per side of \(R^*\), each containing no positive sample.
Sample complexity. For each side \(j\) (\(j = 1, \ldots , 4\)), let \(T_j\) be the strip of \(R^*\) closest to side \(j\) with probability mass exactly \(\varepsilon /4\) under \(D\). If a positive example falls in \(T_j\), then \(R_S\) extends into \(T_j\) and that strip’s contribution to the error drops below \(\varepsilon /4\). The probability that strip \(T_j\) is missed entirely by \(m\) i.i.d. samples is at most \((1 - \varepsilon /4)^m \leq e^{-m\varepsilon /4}\). A union bound over four strips gives
Setting the right-hand side \(\leq \delta \) yields \(m = \frac{4}{\varepsilon }\ln \frac{4}{\delta } = O\! \bigl(\frac{d + \log (1/\delta )}{\varepsilon }\bigr)\) with \(d = 4\), matching the Fundamental Theorem’s prediction with the tight \(1/\varepsilon \) dependence.
The analysis makes visible what the general proof obscures: the error region has geometric structure (four strips), the union bound exploits the number of strips (controlled by VC dimension), and the exponential decay in \(m\) comes from the i.i.d. assumption applied to each strip independently. Replacing \(\mathbb {R}^2\) with \(\mathbb {R}^k\) gives rectangles with \(\mathrm{VCdim}= 2k\) and \(2k\) strips, and the same argument yields \(m = O(k/\varepsilon \cdot \log (k/\delta ))\).
Three features of this definition deserve emphasis:
Distribution-free. The guarantee holds for every distribution \(D\). The learner does not know \(D\) and cannot assume anything about it.
Approximate. The learner need not find \(c^*\) exactly; error \(\leq \varepsilon \) suffices.
Probably. The guarantee is probabilistic: it may fail with probability \(\delta \), but \(\delta \) can be driven arbitrarily small by drawing more samples.
The \(\delta \) parameter is structurally unimportant for characterization purposes: a bound with confidence \(1 - \delta \) can always be converted to one with confidence \(1 - \delta '\) by replacing \(m\) with \(m \cdot \lceil \log (1/\delta ')/\log (1/\delta )\rceil \) and taking a majority vote. Consequently, the sample complexity depends on \(\log (1/\delta )\), not on \(1/\delta \), and the VC characterization is insensitive to how \(\delta \) enters.
Lean: BatchLearner
Given sample \(S = \{ (x_i, c^*(x_i))\} _{i=1}^m\) and hypothesis class \(\mathcal{H}\), the empirical risk minimizer is
In the realizable setting, \(\operatorname{ERM}\) returns any \(h \in \mathcal{H}\) consistent with \(S\) (since \(\hat{R}_S(c^*) = 0\)).
The question this chapter answers: for which \(\mathcal{H}\) does ERM succeed, and when is PAC learning possible at all?