4.6 The PAC–Online Gap
The VC dimension and the Littlestone dimension both measure the complexity of a hypothesis class, but they measure different things. How do they relate?
Lean: BranchWiseLittlestoneDim_ge_VCDim
For any hypothesis class \(\mathcal{H}\), \(\mathrm{VCdim}(\mathcal{H}) \leq \mathrm{Ldim}(\mathcal{H})\).
Let \(S = \{ x_1, \ldots , x_d\} \) be a set shattered by \(\mathcal{H}\), so \(\mathrm{VCdim}(\mathcal{H}) \geq d\). We construct a complete mistake tree of depth \(d\). The root is labeled \(x_1\). Every node at depth \(k\) is labeled \(x_{k+1}\) (the same instance at every node of a given depth). For any root-to-leaf path with labels \((y_1, \ldots , y_d)\), shattering guarantees that some \(h \in \mathcal{H}\) satisfies \(h(x_i) = y_i\) for all \(i\). Thus this is a valid mistake tree of depth \(d\), so \(\mathrm{Ldim}(\mathcal{H}) \geq d\).
The inequality \(\mathrm{VCdim}\leq \mathrm{Ldim}\) is often tight (e.g., for halfspaces, where both equal the ambient dimension). But the gap can be infinite.
Lean: pac_not_implies_online
Let \(\mathcal{H}_{\mathrm{thr}} = \{ x \mapsto \mathbf{1}[x \geq \theta ] : \theta \in \mathbb {R}\} \). Then \(\mathrm{VCdim}(\mathcal{H}_{\mathrm{thr}}) = 1\) and \(\mathrm{Ldim}(\mathcal{H}_{\mathrm{thr}}) = \infty \).
\(\mathrm{VCdim}= 1\): Any single point \(x\) is shattered (take thresholds \(\theta {\lt} x\) and \(\theta {\gt} x\)). No two points \(x_1 {\lt} x_2\) are shattered, because \(h(x_1) = 1, h(x_2) = 0\) requires \(x_1 \geq \theta {\gt} x_2\), which contradicts \(x_1 {\lt} x_2\).
\(\mathrm{Ldim}= \infty \): We construct mistake trees of arbitrary depth. For any \(d\), consider the domain restricted to \(\{ 1, 2, \ldots , 2^d\} \). The adversary can perform binary search: at depth \(k\), present the midpoint of the current interval. Whatever the learner predicts, the adversary labels to force a mistake (and narrows the interval by half). After \(d\) rounds, the adversary has traced a path from root to leaf, consistent with the threshold at the boundary of the final interval. Since \(d\) was arbitrary, \(\mathrm{Ldim}= \infty \).
This separation is the most important non-implication in the concept graph. It witnesses the edge \(\textsf{\small pac\_ learning} \xrightarrow {\texttt{\small does\_ not\_ imply}} \textsf{\small online\_ learning}\): a class can be PAC learnable (finite VC dimension) yet not online learnable (infinite Littlestone dimension). The converse implication does hold: \(\mathrm{Ldim}{\lt} \infty \) implies \(\mathrm{VCdim}{\lt} \infty \) (by \(\mathrm{VCdim}\leq \mathrm{Ldim}\)), so online learnability implies PAC learnability.
PAC \(\not\Rightarrow \) Online. Witness: \(\mathcal{H}_{\mathrm{thr}}\) on \(\mathbb {R}\). The adversary in the online game can exploit the order structure of \(\mathbb {R}\) by performing binary search on the threshold. A random sample from a fixed distribution cannot exploit this ordering, with high probability, the sample reveals the threshold’s location to within \(\varepsilon \). The gap between PAC and online learnability is not a technicality: it reflects a fundamental difference between random and adversarial data.
In PAC learning, the distribution \(D\) is fixed before the game begins; the learner faces a static environment. In online learning, the adversary adapts to the learner’s behavior. Adaptivity is the source of the gap. A threshold on \(\mathbb {R}\) is easy to learn from random data (one sample suffices, approximately) but impossible to learn from adversarial data (the adversary can always present a point that bisects the remaining uncertainty). The Littlestone dimension measures the depth of the adversary’s adaptive search, which can exceed the VC dimension’s non-adaptive combinatorics by an arbitrary amount. A full treatment of this and related separations appears in Chapter 6.