4.6 The PAC–Online Gap

The VC dimension and the Littlestone dimension both measure the complexity of a hypothesis class, but they measure different things. How do they relate?

Proposition 4.17 $\mathrm{VCdim}\leq \mathrm{Ldim}$

Lean: BranchWiseLittlestoneDim_ge_VCDim

For any hypothesis class $\mathcal{H}$, $\mathrm{VCdim}(\mathcal{H}) \leq \mathrm{Ldim}(\mathcal{H})$.

Proof ▼

Let $S = \{ x_1, \ldots , x_d\} $ be a set shattered by $\mathcal{H}$, so $\mathrm{VCdim}(\mathcal{H}) \geq d$. We construct a complete mistake tree of depth $d$. The root is labeled $x_1$. Every node at depth $k$ is labeled $x_{k+1}$ (the same instance at every node of a given depth). For any root-to-leaf path with labels $(y_1, \ldots , y_d)$, shattering guarantees that some $h \in \mathcal{H}$ satisfies $h(x_i) = y_i$ for all $i$. Thus this is a valid mistake tree of depth $d$, so $\mathrm{Ldim}(\mathcal{H}) \geq d$.

\begin{tikzpicture} [
level distance=1.4cm,
every node/.style={font=\small},
treenode/.style={draw, circle, thick, minimum size=0.65cm, fill=defblue!10, font=\small},
leafnode/.style={draw, rectangle, rounded corners, thick, minimum width=0.5cm, minimum height=0.4cm, fill=sepgreen!15, font=\scriptsize},
edgelabel/.style={font=\scriptsize, midway},
level 1/.style={sibling distance=3.6cm},
level 2/.style={sibling distance=1.8cm},
]
\node[treenode] {$x_1$}
child { node[treenode] {$x_2$}
child { node[leafnode] {$(0,0)$} edge from parent node[edgelabel, left] {$0$} }
child { node[leafnode] {$(0,1)$} edge from parent node[edgelabel, right] {$1$} }
edge from parent node[edgelabel, left] {$0$}
}
child { node[treenode] {$x_2$}
child { node[leafnode] {$(1,0)$} edge from parent node[edgelabel, left] {$0$} }
child { node[leafnode] {$(1,1)$} edge from parent node[edgelabel, right] {$1$} }
edge from parent node[edgelabel, right] {$1$}
};

% Annotation
\node[font=\scriptsize\itshape, text=notegray, align=left, anchor=west] at (3.5, -1.4) {
Shattered set $\{x_1, x_2\}$ gives a\\
non-adaptive mistake tree:\\
same instance at each depth.\\[3pt]
Shattering $\Rightarrow$ valid mistake tree.\\
Adaptive tree $\not\Rightarrow$ shattered set.
};
\end{tikzpicture} — Figure 4.3 A shattered set of size $2$ yields a (non-adaptive) mistake tree of depth $2$. The instances at each depth are identical, so the tree does not exploit adaptivity. This construction proves $\mathrm{VCdim}(\mathcal{H}) \leq \mathrm{Ldim}(\mathcal{H})$.

The inequality $\mathrm{VCdim}\leq \mathrm{Ldim}$ is often tight (e.g., for halfspaces, where both equal the ambient dimension). But the gap can be infinite.

Proposition 4.18 The PAC–Online separation: thresholds on $\mathbb {R}$

Lean: pac_not_implies_online

Let $\mathcal{H}_{\mathrm{thr}} = \{ x \mapsto \mathbf{1}[x \geq \theta ] : \theta \in \mathbb {R}\} $. Then $\mathrm{VCdim}(\mathcal{H}_{\mathrm{thr}}) = 1$ and $\mathrm{Ldim}(\mathcal{H}_{\mathrm{thr}}) = \infty $.

Proof ▼

$\mathrm{VCdim}= 1$: Any single point $x$ is shattered (take thresholds $\theta {\lt} x$ and $\theta {\gt} x$). No two points $x_1 {\lt} x_2$ are shattered, because $h(x_1) = 1, h(x_2) = 0$ requires $x_1 \geq \theta {\gt} x_2$, which contradicts $x_1 {\lt} x_2$.

$\mathrm{Ldim}= \infty $: We construct mistake trees of arbitrary depth. For any $d$, consider the domain restricted to $\{ 1, 2, \ldots , 2^d\} $. The adversary can perform binary search: at depth $k$, present the midpoint of the current interval. Whatever the learner predicts, the adversary labels to force a mistake (and narrows the interval by half). After $d$ rounds, the adversary has traced a path from root to leaf, consistent with the threshold at the boundary of the final interval. Since $d$ was arbitrary, $\mathrm{Ldim}= \infty $.

This separation is the most important non-implication in the concept graph. It witnesses the edge $\textsf{\small pac\_ learning} \xrightarrow {\texttt{\small does\_ not\_ imply}} \textsf{\small online\_ learning}$: a class can be PAC learnable (finite VC dimension) yet not online learnable (infinite Littlestone dimension). The converse implication does hold: $\mathrm{Ldim}{\lt} \infty $ implies $\mathrm{VCdim}{\lt} \infty $ (by $\mathrm{VCdim}\leq \mathrm{Ldim}$), so online learnability implies PAC learnability.

PAC $\not\Rightarrow $ Online. Witness: $\mathcal{H}_{\mathrm{thr}}$ on $\mathbb {R}$. The adversary in the online game can exploit the order structure of $\mathbb {R}$ by performing binary search on the threshold. A random sample from a fixed distribution cannot exploit this ordering, with high probability, the sample reveals the threshold’s location to within $\varepsilon $. The gap between PAC and online learnability is not a technicality: it reflects a fundamental difference between random and adversarial data.

Remark 4.19 Why the gap arises

In PAC learning, the distribution $D$ is fixed before the game begins; the learner faces a static environment. In online learning, the adversary adapts to the learner’s behavior. Adaptivity is the source of the gap. A threshold on $\mathbb {R}$ is easy to learn from random data (one sample suffices, approximately) but impossible to learn from adversarial data (the adversary can always present a point that bisects the remaining uncertainty). The Littlestone dimension measures the depth of the adversary’s adaptive search, which can exceed the VC dimension’s non-adaptive combinatorics by an arbitrary amount. A full treatment of this and related separations appears in Chapter 6.