4.2 Mistake Trees and the Littlestone Dimension
The VC dimension counts the size of the largest set that \(\mathcal{H}\) can shatter. The Littlestone dimension counts the depth of the largest tree that \(\mathcal{H}\) can shatter. This shift, from sets to trees, is what captures the adversary’s adaptive power.
Lean: LTree.isShattered
A mistake tree for \(\mathcal{H}\) over \(X\) is a complete binary tree \(T\) in which:
- [leftmargin=*, itemsep=2pt]
Each internal node \(v\) is labeled with an instance \(x_v \in X\).
The left child of \(v\) corresponds to the label \(y = 0\) and the right child to \(y = 1\).
For every root-to-leaf path \((v_1, y_1), (v_2, y_2), \ldots , (v_d, y_d)\), there exists a hypothesis \(h \in \mathcal{H}\) consistent with all labels along the path: \(h(x_{v_i}) = y_i\) for all \(i\).
The depth of the tree is the number of internal nodes on any root-to-leaf path (all paths have the same length since the tree is complete).
The key idea is that a mistake tree encodes an adversary strategy. Starting at the root, the adversary presents \(x_{v_1}\). Whatever the learner predicts, the adversary can label \(x_{v_1}\) to make the prediction wrong and descend to the corresponding child. Realizability is maintained because every root-to-leaf path is consistent with some \(h \in \mathcal{H}\). Thus a mistake tree of depth \(d\) represents an adversary strategy that forces any learner to make at least \(d\) mistakes.
Lean: LittlestoneDim
The Littlestone dimension of a hypothesis class \(\mathcal{H}\), denoted \(\mathrm{Ldim}(\mathcal{H})\), is the maximum depth of a complete mistake tree for \(\mathcal{H}\). If arbitrarily deep mistake trees exist, \(\mathrm{Ldim}(\mathcal{H}) = \infty \).
Finite classes. If \(|\mathcal{H}| {\lt} \infty \), then \(\mathrm{Ldim}(\mathcal{H}) \leq \lfloor \log _2 |\mathcal{H}| \rfloor \), since a complete binary tree of depth \(d\) has \(2^d\) leaves and each leaf requires a distinct consistent hypothesis. This bound is tight for classes that are “richly structured enough” (e.g., all functions on a domain of size \(\log _2 |\mathcal{H}|\)).
Thresholds on \(\{ 1, \ldots , n\} \). The class \(\{ h_{\leq \theta } : \theta \in \{ 0, 1, \ldots , n\} \} \) has \(\mathrm{Ldim}= \lceil \log _2(n+1) \rceil \). The mistake tree in Figure 4.2 illustrates the case \(n = 8\). The adversary performs binary search on the threshold.
Thresholds on \(\mathbb {R}\). Now \(\mathrm{Ldim}= \infty \). The key point: the domain is dense, so the adversary can always find a point between any two previous thresholds. At each round, the adversary presents a point that bisects the current interval of uncertainty, forcing a mistake no matter what the learner predicts. This produces mistake trees of unbounded depth.
Intervals on \(\mathbb {R}\). The class \(\{ x \mapsto \mathbf{1}[a \leq x \leq b] : a, b \in \mathbb {R}\} \) also has \(\mathrm{Ldim}= \infty \). The adversary can adaptively narrow down both endpoints.
Linear classifiers in \(\mathbb {R}^d\). The class of halfspaces has \(\mathrm{Ldim}= d\), matching the VC dimension. This is one of the rare cases where the two dimensions coincide.
Shattering a set \(\{ x_1, \ldots , x_d\} \) means \(\mathcal{H}\) can produce all \(2^d\) labelings of these fixed points. Shattering a tree of depth \(d\) means \(\mathcal{H}\) can produce a consistent labeling along every root-to-leaf path, but the points themselves may depend on previous labels. The tree structure captures adaptivity: the adversary’s choice at depth \(k\) depends on the learner’s responses at depths \(1, \ldots , k-1\). A shattered set is a special case of a mistake tree (one in which every node at the same depth is labeled with the same instance), so \(\mathrm{VCdim}(\mathcal{H}) \leq \mathrm{Ldim}(\mathcal{H})\) always holds. The converse fails spectacularly: thresholds on \(\mathbb {R}\) have \(\mathrm{VCdim}= 1\) but \(\mathrm{Ldim}= \infty \).