4.4 The Lower Bound: The Adversary Strategy

The upper bound shows that the \(\operatorname{SOA}\) can limit mistakes to \(\mathrm{Ldim}(\mathcal{H})\). The lower bound shows that no learner can do better: for any learner, there exists an adversary that forces at least \(\mathrm{Ldim}(\mathcal{H})\) mistakes.

Theorem 4.11 Lower Bound: Any learner makes at least \(\mathrm{Ldim}(\mathcal{H})\) mistakes [

Lean: forward_direction

For any hypothesis class \(\mathcal{H}\) with \(\mathrm{Ldim}(\mathcal{H}) = d\) and any (possibly randomized) learner \(A\), there exists an adversary strategy that forces \(A\) to make at least \(d\) mistakes.

Proof ▼

Let \(T\) be a complete mistake tree for \(\mathcal{H}\) of depth \(d\). The adversary plays \(T\) as follows.

The adversary maintains a pointer to a node of \(T\), starting at the root. At round \(t\), the adversary presents the instance \(x_{v_t}\) labeling the current node \(v_t\). The learner predicts \(\hat{y}_t\). The adversary then sets \(y_t = 1 - \hat{y}_t\) (the opposite of the learner’s prediction) and descends to the child of \(v_t\) corresponding to label \(y_t\).

This strategy has two properties:

Every round is a mistake. By construction, \(y_t = 1 - \hat{y}_t \neq \hat{y}_t\).
Realizability is maintained. After \(d\) rounds, the adversary has descended from the root to a leaf of \(T\), tracing a path \((v_1, y_1), (v_2, y_2), \ldots , (v_d, y_d)\). By the definition of a mistake tree, there exists \(h^* \in \mathcal{H}\) consistent with all labels along this path: \(h^*(x_{v_i}) = y_i\) for all \(i \in \{ 1, \ldots , d\} \). For any subsequent rounds \(t {\gt} d\), the adversary can label consistently with this \(h^*\).

The adversary forces exactly \(d\) mistakes in the first \(d\) rounds, so \(A\) makes at least \(d\) mistakes.

For randomized learners, the argument extends by the minimax theorem (or directly): for any distribution over predictions \(\hat{y}_t\), the adversary’s strategy of playing the opposite forces an expected mistake at every round. Alternatively, one can apply Yao’s minimax principle: the worst-case deterministic adversary against the best randomized learner equals the best deterministic learner against the worst-case adversary, and we have shown the latter is at least \(d\).

Combining ??, we obtain the fundamental characterization.

Theorem 4.12 Littlestone’s Characterization [

Lean: littlestone_characterization

For any hypothesis class \(\mathcal{H}\):

\[ \mathrm{Opt}(\mathcal{H}) = \mathrm{Ldim}(\mathcal{H}). \]

That is, the optimal mistake bound of \(\mathcal{H}\) in the online learning game is exactly the Littlestone dimension of \(\mathcal{H}\). In particular, \(\mathcal{H}\) admits a finite mistake bound if and only if \(\mathrm{Ldim}(\mathcal{H}) {\lt} \infty \).

Littlestone proved this characterization in 1988 [ , just four years after Valiant’s PAC model. The result established online learning as a mathematically independent paradigm: the combinatorial quantity governing online learnability is genuinely different from the one governing PAC learnability. The \(\operatorname{SOA}\) is sometimes called the “Standard Optimal Algorithm” precisely because it achieves the information-theoretic optimum; the name is due to the online learning community’s convention.