4.3 The Standard Optimal Algorithm
The Halving algorithm is the simplest reasonable strategy for online learning: maintain the version space (the set of hypotheses consistent with all observations so far), and predict the majority vote of the version space. Each mistake eliminates at least half of the surviving hypotheses, so Halving makes at most \(\lfloor \log _2 |\mathcal{H}| \rfloor \) mistakes when \(\mathcal{H}\) is finite.
But Halving is not optimal for infinite hypothesis classes. The Standard Optimal Algorithm (SOA), due to Littlestone [ , refines the Halving idea by replacing “majority vote” with a more subtle criterion: predict the label whose version space has the larger Littlestone dimension.
Lean: SOA_predict_eq
Given a hypothesis class \(\mathcal{H}\), the \(\operatorname{SOA}\) maintains a version space \(V_t \subseteq \mathcal{H}\), initially \(V_1 = \mathcal{H}\). At round \(t\), upon receiving instance \(x_t\):
Partition \(V_t\) into \(V_t^0 = \{ h \in V_t : h(x_t) = 0\} \) and \(V_t^1 = \{ h \in V_t : h(x_t) = 1\} \).
Predict \(\hat{y}_t = \arg \max _{b \in \{ 0,1\} } \mathrm{Ldim}(V_t^b)\).
After observing the true label \(y_t\), update \(V_{t+1} = V_t^{y_t}\).
Ties in step 2 are broken arbitrarily.
The intuition behind the \(\operatorname{SOA}\) is a tree-based version of binary search. At each round, the algorithm asks: “If I predict \(0\) and am wrong, what is the worst the adversary can do to \(V_t^1\)? And if I predict \(1\) and am wrong, what can the adversary do to \(V_t^0\)?” By choosing the side with the larger Littlestone dimension, the \(\operatorname{SOA}\) ensures that every mistake reduces the Littlestone dimension of the version space by at least one.
Lean: backward_direction
For any hypothesis class \(\mathcal{H}\) with \(\mathrm{Ldim}(\mathcal{H}) = d {\lt} \infty \), the \(\operatorname{SOA}\) makes at most \(d\) mistakes against any adversary.
We show that the Littlestone dimension of the version space drops by at least one at each mistake. Define \(d_t = \mathrm{Ldim}(V_t)\). We claim:
Proof of the claim. Suppose the \(\operatorname{SOA}\) makes a mistake at round \(t\). This means \(\hat{y}_t \neq y_t\). By the algorithm’s rule, \(\hat{y}_t\) was chosen so that \(\mathrm{Ldim}(V_t^{\hat{y}_t}) \geq \mathrm{Ldim}(V_t^{y_t})\), i.e., the \(\operatorname{SOA}\) predicted the side with the larger (or equal) Littlestone dimension. After the mistake, the version space updates to \(V_{t+1} = V_t^{y_t}\), the side with the smaller (or equal) Littlestone dimension.
Now we must show that \(\mathrm{Ldim}(V_t^{y_t}) \leq d_t - 1\). Suppose for contradiction that \(\mathrm{Ldim}(V_t^0) \geq d_t\) and \(\mathrm{Ldim}(V_t^1) \geq d_t\). Then there exists a complete mistake tree \(T_0\) of depth \(d_t\) for \(V_t^0\) and a complete mistake tree \(T_1\) of depth \(d_t\) for \(V_t^1\). We can construct a complete mistake tree of depth \(d_t + 1\) for \(V_t\): the root is labeled \(x_t\), its left subtree (corresponding to label \(0\)) is \(T_0\), and its right subtree (corresponding to label \(1\)) is \(T_1\). Every root-to-leaf path through the left subtree is consistent with some \(h \in V_t^0 \subseteq V_t\) (and \(h(x_t) = 0\)), and similarly for the right subtree. This yields \(\mathrm{Ldim}(V_t) \geq d_t + 1\), contradicting \(\mathrm{Ldim}(V_t) = d_t\).
Therefore \(\min \{ \mathrm{Ldim}(V_t^0), \mathrm{Ldim}(V_t^1)\} \leq d_t - 1\). Since \(\hat{y}_t\) selects the side with the larger dimension, \(V_{t+1} = V_t^{y_t}\) is the side with the smaller dimension, so \(d_{t+1} = \mathrm{Ldim}(V_{t+1}) \leq d_t - 1\).
Completing the proof. We have \(d_1 = \mathrm{Ldim}(\mathcal{H}) = d\). Each mistake decreases \(d_t\) by at least \(1\). Since the Littlestone dimension is a non-negative integer (the version space always contains the target \(h^*\), so \(V_t \neq \emptyset \) and \(\mathrm{Ldim}(V_t) \geq 0\)), the total number of mistakes is at most \(d\).
For finite \(\mathcal{H}\), the Halving algorithm predicts by majority vote: \(\hat{y}_t = \arg \max _b |V_t^b|\). Each mistake halves the version space, giving at most \(\lfloor \log _2 |\mathcal{H}| \rfloor \) mistakes. The \(\operatorname{SOA}\) replaces cardinality \(|V_t^b|\) with \(\mathrm{Ldim}(V_t^b)\). For finite classes, this distinction is often unimportant (both give \(O(\log |\mathcal{H}|)\) mistakes), but for infinite classes, cardinality is meaningless while the Littlestone dimension is finite and well-behaved.
The \(\operatorname{SOA}\) is an information-theoretically optimal algorithm, but it is not necessarily computationally efficient. Computing \(\mathrm{Ldim}(V_t^b)\) at each round may be undecidable for some hypothesis classes. The \(\operatorname{SOA}\) is thus an existence proof: it shows that \(d\) mistakes suffice, even if finding the optimal prediction at each step is computationally hard. Efficient online algorithms for specific classes (Perceptron for halfspaces, Winnow for disjunctions) achieve near-optimal mistake bounds through class-specific structure.