Formal Learning Theory Kernel: Blueprint v1

4.2 Mistake Trees and the Littlestone Dimension

The VC dimension counts the size of the largest set that \(\mathcal{H}\) can shatter. The Littlestone dimension counts the depth of the largest tree that \(\mathcal{H}\) can shatter. This shift, from sets to trees, is what captures the adversary’s adaptive power.

Definition 4.3 Mistake Tree

Lean: LTree.isShattered

A mistake tree for \(\mathcal{H}\) over \(X\) is a complete binary tree \(T\) in which:

  • [leftmargin=*, itemsep=2pt]
  • Each internal node \(v\) is labeled with an instance \(x_v \in X\).

  • The left child of \(v\) corresponds to the label \(y = 0\) and the right child to \(y = 1\).

  • For every root-to-leaf path \((v_1, y_1), (v_2, y_2), \ldots , (v_d, y_d)\), there exists a hypothesis \(h \in \mathcal{H}\) consistent with all labels along the path: \(h(x_{v_i}) = y_i\) for all \(i\).

The depth of the tree is the number of internal nodes on any root-to-leaf path (all paths have the same length since the tree is complete).

The key idea is that a mistake tree encodes an adversary strategy. Starting at the root, the adversary presents \(x_{v_1}\). Whatever the learner predicts, the adversary can label \(x_{v_1}\) to make the prediction wrong and descend to the corresponding child. Realizability is maintained because every root-to-leaf path is consistent with some \(h \in \mathcal{H}\). Thus a mistake tree of depth \(d\) represents an adversary strategy that forces any learner to make at least \(d\) mistakes.

Definition 4.4 Littlestone Dimension [

Lean: LittlestoneDim

The Littlestone dimension of a hypothesis class \(\mathcal{H}\), denoted \(\mathrm{Ldim}(\mathcal{H})\), is the maximum depth of a complete mistake tree for \(\mathcal{H}\). If arbitrarily deep mistake trees exist, \(\mathrm{Ldim}(\mathcal{H}) = \infty \).

\begin{tikzpicture} [
    level distance=1.6cm,
    sibling distance=1.2cm,
    every node/.style={font=\small},
    treenode/.style={draw, circle, thick, minimum size=0.7cm, fill=onlineblue!10, font=\small\bfseries},
    leafnode/.style={draw, rectangle, rounded corners, thick, minimum width=0.6cm, minimum height=0.45cm, fill=sepgreen!15, font=\scriptsize},
    edgelabel/.style={font=\scriptsize, midway},
    level 1/.style={sibling distance=5.6cm},
    level 2/.style={sibling distance=2.8cm},
    level 3/.style={sibling distance=1.4cm},
]
\node[treenode] (root) {$4$}
    child { node[treenode] {$2$}
        child { node[treenode] {$1$}
            child { node[leafnode] {$h_{\leq 0}$} edge from parent node[edgelabel, left] {$0$} }
            child { node[leafnode] {$h_{\leq 1}$} edge from parent node[edgelabel, right] {$1$} }
            edge from parent node[edgelabel, left] {$0$}
        }
        child { node[treenode] {$3$}
            child { node[leafnode] {$h_{\leq 2}$} edge from parent node[edgelabel, left] {$0$} }
            child { node[leafnode] {$h_{\leq 3}$} edge from parent node[edgelabel, right] {$1$} }
            edge from parent node[edgelabel, right] {$1$}
        }
        edge from parent node[edgelabel, left] {$0$}
    }
    child { node[treenode] {$6$}
        child { node[treenode] {$5$}
            child { node[leafnode] {$h_{\leq 4}$} edge from parent node[edgelabel, left] {$0$} }
            child { node[leafnode] {$h_{\leq 5}$} edge from parent node[edgelabel, right] {$1$} }
            edge from parent node[edgelabel, left] {$0$}
        }
        child { node[treenode] {$7$}
            child { node[leafnode] {$h_{\leq 6}$} edge from parent node[edgelabel, left] {$0$} }
            child { node[leafnode] {$h_{\leq 7}$} edge from parent node[edgelabel, right] {$1$} }
            edge from parent node[edgelabel, right] {$1$}
        }
        edge from parent node[edgelabel, right] {$1$}
    };

% Annotation
\node[right=0.3cm of root, font=\scriptsize\itshape, text=notegray, align=left, anchor=west, xshift=3.5cm] {
    Domain: $X = \{1,2,\ldots,8\}$\\[2pt]
    Class: thresholds $h_{\leq \theta}(x) = \ind[x \leq \theta]$\\[2pt]
    Each internal node presents an instance.\\[2pt]
    Each leaf witnesses a consistent $h \in \mathcal{H}$.\\[2pt]
    Depth $= 3 = \lceil \log_2 8 \rceil$.
};
\end{tikzpicture}
Figure 4.2 A complete mistake tree of depth \(3\) for the threshold class on \(\{ 1,\ldots ,8\} \). Each internal node is labeled with an instance; the left and right children correspond to labels \(0\) and \(1\). Every root-to-leaf path is consistent with the threshold hypothesis shown at the leaf. The adversary can traverse this tree from root to any leaf, forcing \(3\) mistakes.
Example 4.5 Littlestone dimension of fundamental classes

  1. Finite classes. If \(|\mathcal{H}| {\lt} \infty \), then \(\mathrm{Ldim}(\mathcal{H}) \leq \lfloor \log _2 |\mathcal{H}| \rfloor \), since a complete binary tree of depth \(d\) has \(2^d\) leaves and each leaf requires a distinct consistent hypothesis. This bound is tight for classes that are “richly structured enough” (e.g., all functions on a domain of size \(\log _2 |\mathcal{H}|\)).

  2. Thresholds on \(\{ 1, \ldots , n\} \). The class \(\{ h_{\leq \theta } : \theta \in \{ 0, 1, \ldots , n\} \} \) has \(\mathrm{Ldim}= \lceil \log _2(n+1) \rceil \). The mistake tree in Figure 4.2 illustrates the case \(n = 8\). The adversary performs binary search on the threshold.

  3. Thresholds on \(\mathbb {R}\). Now \(\mathrm{Ldim}= \infty \). The key point: the domain is dense, so the adversary can always find a point between any two previous thresholds. At each round, the adversary presents a point that bisects the current interval of uncertainty, forcing a mistake no matter what the learner predicts. This produces mistake trees of unbounded depth.

  4. Intervals on \(\mathbb {R}\). The class \(\{ x \mapsto \mathbf{1}[a \leq x \leq b] : a, b \in \mathbb {R}\} \) also has \(\mathrm{Ldim}= \infty \). The adversary can adaptively narrow down both endpoints.

  5. Linear classifiers in \(\mathbb {R}^d\). The class of halfspaces has \(\mathrm{Ldim}= d\), matching the VC dimension. This is one of the rare cases where the two dimensions coincide.

Remark 4.6 Sets vs. trees

Shattering a set \(\{ x_1, \ldots , x_d\} \) means \(\mathcal{H}\) can produce all \(2^d\) labelings of these fixed points. Shattering a tree of depth \(d\) means \(\mathcal{H}\) can produce a consistent labeling along every root-to-leaf path, but the points themselves may depend on previous labels. The tree structure captures adaptivity: the adversary’s choice at depth \(k\) depends on the learner’s responses at depths \(1, \ldots , k-1\). A shattered set is a special case of a mistake tree (one in which every node at the same depth is labeled with the same instance), so \(\mathrm{VCdim}(\mathcal{H}) \leq \mathrm{Ldim}(\mathcal{H})\) always holds. The converse fails spectacularly: thresholds on \(\mathbb {R}\) have \(\mathrm{VCdim}= 1\) but \(\mathrm{Ldim}= \infty \).