Formal Learning Theory Kernel: Blueprint v1

6.2 Separations Between Paradigms

We now present the 9 does_not_imply edges in the graph, each with its witness construction. The order moves from the most elementary witness (a one-parameter family on \(\mathbb {R}\)) to the most conceptually involved (the breakdown of the fundamental theorem beyond binary classification).

PAC learning \(\not\Rightarrow \) mistake-bounded learning.

Witness. Let \(\mathcal{H} = \{ h_\theta : \theta \in \mathbb {R}\} \) be the class of thresholds on \(\mathbb {R}\), where \(h_\theta (x) = \mathbf{1}[x \geq \theta ]\).

This class has \(\mathrm{VCdim}(\mathcal{H}) = 1\): a single point \(x\) is shattered (choose \(\theta {\lt} x\) or \(\theta {\gt} x\)), but no two points can be simultaneously shattered. By the fundamental theorem of statistical learning, \(\mathcal{H}\) is PAC learnable with sample complexity \(O(1/\varepsilon )\).

However, \(\mathrm{Ldim}(\mathcal{H}) = \infty \). An adaptive adversary can force arbitrarily many mistakes by binary search: maintain an interval \((a, b)\) of uncertainty, present the midpoint, and whichever label the learner predicts, place the true threshold on the opposite side. Each round forces a mistake and halves the interval, but the reals admit no finite termination.

What the witness exploits. The gap between VC and Littlestone dimension: batch learnability (distribution-free, i.i.d. samples) does not imply sequential learnability (adversarial, adaptive instances). The reals are “too dense” for any online strategy to pin down the threshold, but a random sample reveals it with high probability.

[

Remark 6.1

Lean: pac_not_implies_online

Formalized in the kernel; see the source link.

Ex-learning \(\not\Rightarrow \) PAC learning.

Witness. Let \(\mathcal{C} = \{ S \subseteq \mathbb {N}: S \text{ is finite} \} \), the class of all finite subsets of \(\mathbb {N}\). A Gold-style learner can identify any finite \(S\) in the limit from positive data: enumerate elements as they appear, and at each stage conjecture the set of elements seen so far. After the last new element arrives, the conjecture stabilizes on \(S\).

However, \(\mathrm{VCdim}(\mathcal{C}) = \infty \): for any \(n\) points \(\{ x_1, \ldots , x_n\} \subset \mathbb {N}\), every subset is itself a finite set in \(\mathcal{C}\), so the class shatters every finite set. The fundamental theorem then implies \(\mathcal{C}\) is not PAC learnable.

What the witness exploits. The data models are incompatible. Gold-style identification receives an infinite enumeration of the target’s elements and succeeds in the limit; PAC learning receives a finite i.i.d. sample and must generalize immediately. A class can be identifiable from exhaustive enumeration yet have no finite VC dimension.

[

Remark 6.2

Lean: ex_not_implies_pac

Formalized in the kernel; see the source link.

PAC learning \(\not\Rightarrow \) exact learning.

Witness. PAC learning requires only \(\varepsilon \)-approximate identification under an unknown distribution; exact learning requires zero-error identification via query access. These are different success criteria on different data models, and neither subsumes the other.

More concretely, consider any concept class \(\mathcal{C}\) that is PAC learnable (finite VC dimension) but for which proper hypothesis selection is NP-hard, for instance, intersections of halfspaces in \(\mathbb {R}^d\) for sufficiently large \(d\). An improper PAC learner can achieve low error using a surrogate hypothesis class, but the exact learning model requires the learner to output a hypothesis exactly equal to the target. When the class is rich enough that finding the exact target is computationally intractable, PAC learnability does not deliver exact learnability.

What the witness exploits. The gap between approximate and exact success criteria. PAC tolerates \(\varepsilon \) error; exact learning tolerates none. This is a criterion mismatch, not a data mismatch.

[

Finite VC dimension \(\not\Rightarrow \) computational tractability.

Witness. Kearns and Valiant  [ constructed concept classes based on polynomial-size Boolean circuits whose VC dimension is polynomial in the circuit size, yet for which PAC learning is computationally intractable under the assumption that the decisional Diffie–Hellman (or related cryptographic) assumption holds.

The construction proceeds as follows. Fix a one-way function family. Define \(\mathcal{C}\) to be the class of functions computed by circuits of size \(s\). This class has \(\mathrm{VCdim}(\mathcal{C}) \leq O(s \log s)\), so it is information-theoretically PAC learnable with \(\mathrm{poly}(s)\) samples. But any polynomial-time PAC learner for \(\mathcal{C}\) could be used to invert the one-way function, contradicting the cryptographic assumption.

What the witness exploits. The fundamental theorem is information-theoretic: it characterizes learnability in terms of sample complexity, not computational complexity. Cryptographic hardness lives in a different layer entirely. Finite VC dimension guarantees that enough data suffices; it says nothing about whether a polynomial-time algorithm can find a good hypothesis from that data.

[

Finite VC dimension \(\not\Rightarrow \) low SQ dimension.

Witness. The class of parities over \(\{ 0,1\} ^n\). Each parity function \(\chi _S\), for \(S \subseteq [n]\), maps \(x \mapsto \bigoplus _{i \in S} x_i\). The parity class has \(\mathrm{VCdim}= n\) (it shatters the standard basis). But any statistical query algorithm requires queries of tolerance \(\tau = O(2^{-n/2})\) or uses \(2^{\Omega (n)}\) queries of polynomial tolerance, because distinct parities are pairwise uncorrelated under the uniform distribution.

Formally, \(\mathrm{SQdim}(\text{Parities}_n) = 2^n\), which is exponential in \(\mathrm{VCdim}= n\).

What the witness exploits. VC dimension measures combinatorial shattering capacity (distribution-free). SQ dimension measures pairwise correlation structure under a specific distribution. Parities are maximally uncorrelated under the uniform distribution, forcing any SQ learner to make exponentially many queries, even though the class is small in the VC sense.

[

Natarajan dimension \(\not\Rightarrow \) multiclass PAC learnability.

Witness. Brukhim et al.  [ constructed a concept class \(\mathcal{C}\) with label space \(Y\) of infinite cardinality, Natarajan dimension \(d_{\mathrm{N}}(\mathcal{C})=1\), yet \(\mathcal{C}\) is not PAC learnable.

The Natarajan dimension, which generalizes VC dimension to multiclass settings by requiring two distinct labelings of shattered points, is too coarse when \(|Y|\) is infinite. The construction builds a class in which every pair of points can be 2-colored in only one way (so \(d_{\mathrm{N}}= 1\)), but the global combinatorial structure is rich enough to prevent uniform convergence.

What the witness exploits. The Natarajan dimension captures pairwise shattering. When the label space is infinite, pairwise structure does not determine global learnability. The DS dimension, which accounts for higher-order combinatorial structure via oriented hypergraphs, is the correct characterization.

[

Proper learnability \(\not\Rightarrow \) efficient proper PAC learning.

Witness. The class of 3-term DNF formulas over \(\{ 0,1\} ^n\). Pitt and Valiant  [ showed that proper PAC learning, where the hypothesis must itself be a 3-term DNF, is NP-hard, assuming \(\mathrm{RP} \neq \mathrm{NP}\). Yet the same class is efficiently PAC learnable improperly: every 3-term DNF can be represented as a 3-CNF, and 3-CNFs are efficiently PAC learnable.

What the witness exploits. The gap between proper and improper learning is purely computational: the information-theoretic sample complexity is identical, but the representation constraint of proper learning introduces NP-hardness. The concept class is simple enough to learn with the right representation, but finding a hypothesis in the same representation class is intractable.

[

Labeled compression \(\not\Rightarrow \) unlabeled compression.

Witness. Pálvölgyi and Tardos  [ exhibited a concept class with \(\mathrm{VCdim}= 2\) that admits a labeled sample compression scheme of size 2 but does not admit an unlabeled compression scheme of size 2.

In an unlabeled compression scheme, the compression function stores only the points (not their labels) from the training sample; the reconstruction function must infer both the hypothesis and the labels from the point set alone. The witness class is constructed so that the label information is essential for reconstruction: two subsets of the same size can correspond to different concepts, and only the labels disambiguate them.

What the witness exploits. The label information carries entropy that cannot always be recovered from the geometry of the point set alone. Unlabeled compression demands that the point configuration determines the concept, which is a strictly stronger requirement than labeled compression.

[

The fundamental theorem \(\not\Rightarrow \) stability characterization.

Witness. The Fundamental Theorem of Statistical Learning (Chapter 3) ties together VC dimension, uniform convergence, and PAC learnability into a single equivalence. Nine conditions, all the same condition. That theorem is the crown jewel of binary classification.

It does not survive contact with real-valued loss.

Shalev-Shwartz et al.  [ construct a learning problem, arbitrary loss, multi-valued output, that is learnable, yet for which uniform convergence fails outright. The nine-way equivalence collapses. In its place, a different quantity takes over: algorithmic stability, measuring how much a learner’s output changes when a single training point is perturbed.

What the witness exploits. The symmetrization argument at the heart of the Fundamental Theorem’s proof requires the loss to take finitely many values. Specifically: two. That assumption is invisible in the binary setting, it looks like a harmless feature of the problem. Make the loss real-valued and symmetrization breaks. Uniform convergence stops characterizing learnability. Stability, indifferent to loss cardinality, remains.

The Fundamental Theorem is not wrong. It is local, an artifact of the 0-1 loss that does not announce itself as such.

[