PAC Learning Theorems #
VC characterization, fundamental theorem (5-way equivalence), Sauer-Shelah, NFL, Occam's algorithm, PAC lower bound.
Key dependencies (K₁-K₃ from Mathlib) #
- K₁: Finset.vcDim + card_le_card_shatterer + card_shatterer_le_sum_vcDim (Sauer-Shelah)
- K₂: lean-rademacher (Rademacher complexity bounds) — external, future import
- K₃: Measure.pi (IsProbabilityMeasure instance for product measures)
- K₄: measure_sum_ge_le_of_iIndepFun (Hoeffding's inequality)
Break Point BP₅: Five Generalization Bounds #
The fundamental theorem bundles five characterizations with different type signatures. This conjunction IS the theorem.
Proof metaprogram for VCDim < ∞ → PACLearnable #
The standard proof has three layers:
- Sauer-Shelah: VCDim < ∞ → growth function is polynomial (Mathlib: card_shatterer_le_sum_vcDim, accessed via B₄ bridge)
- Uniform convergence: polynomial growth + Hoeffding → ∀D, P[|emp_err - true_err| > ε] < δ (Mathlib: measure_sum_ge_le_of_iIndepFun for concentration)
- ERM works: uniform convergence → any ERM learner PAC-learns C (Connects to BatchLearner via output_in_H)
The reverse direction PACLearnable → VCDim < ∞ uses a probabilistic construction: if VCDim = ∞, construct a distribution D where any learner fails with probability > δ for some ε.
Direction ←: finite VCDim implies PAC learnability.
PROOF ROUTE (via new infrastructure in Generalization.lean): Step 1: VCDim < ∞ → HasUniformConvergence (vcdim_finite_imp_uc) Sub-step 1a: Sauer-Shelah gives GrowthFunction bound Sub-step 1b: Symmetrization reduces UC to growth function counting Sub-step 1c: Concentration inequality closes the bound Step 2: HasUniformConvergence → PACLearnable (uc_imp_pac) Sub-step 2a: Construct ERM learner Sub-step 2b: ERM is consistent in realizable case Sub-step 2c: Consistent + UC → low TrueError
KU₁₈: C.Nonempty is needed for ERM but not stated as hypothesis. If C = ∅, then PACLearnable is vacuously true (∀ c ∈ C, ... is vacuous). But ERM needs a fallback hypothesis from C. Is this a genuine gap or does the empty case work out vacuously?
Counterdefinition (COUNTER-4): If the ERM approach fails for computational reasons (ERM is noncomputable, and we need a computable learner for computational learning theory), swap to the compression-based proof: VCDim < ∞ → finite compression scheme (Moran-Yehudayoff 2016) → compression scheme learner is PAC. Swap condition: When proving COMPUTATIONAL PAC learnability (polynomial time).
Direction →: PAC learnability implies finite VCDim.
PROOF ROUTE (via double-sample infrastructure in Generalization.lean):
Step 1: Contrapositive — assume VCDim = ∞
Step 2: For m = mf(ε,δ), extract S with |S| = 2m shattered by C
(uses WithTop.eq_top_iff_forall_ge, same as vcdim_univ_infinite)
Step 3: Construct D = uniform on S (Finset.uniformMeasure?)
KU₁₉: Mathlib's uniform measure on a finite set — does
MeasureTheory.Measure.count / Finset.card give IsProbabilityMeasure?
Step 4: Double-sample trick via GhostSample + symmetrization
Step 5: Counting argument on restricted labelings
HC at this joint: Step 3 requires constructing a specific probability measure from a combinatorial object (the shattered set). This is a P₁→P₂ crossing. UK₉: The construction of the hard distribution is the only non-constructive step. Can it be made constructive? (Related to derandomization in learning.)
VC characterization: C is PAC-learnable iff VCDim(C) < ∞.
PROOF DECOMPOSITION: This theorem factors through the two directions above: ← : vcdim_finite_imp_uc + uc_imp_pac (in Generalization.lean) → : pac_imp_vcdim_finite (contrapositive via double-sample)
HC at this joint: The ← direction crosses from combinatorics (VCDim, GrowthFunction) to measure theory (Measure.pi, TrueError). The → direction crosses from measure theory back to combinatorics. Both crossings have HC > 0.
UK₈: The ↔ hides an ASYMMETRY: the ← proof is constructive (produces ERM), while the → proof is non-constructive (produces hard distribution).
Sauer-Shelah lemma: if VCDim(C) = d and m ≥ d, then the growth function is bounded by the polynomial Σᵢ₌₀ᵈ C(m,i).
This is the quantitative version. The bound is tight. For m ≥ d, Σᵢ₌₀ᵈ C(m,i) ≤ (em/d)^d.
Proof via Mathlib bridge:
- Bridge our Shatters to Finset.Shatters (B₃)
- Apply card_shatterer_le_sum_vcDim from Mathlib
- Bridge back to our GrowthFunction
Weak Sauer-Shelah (legacy statement, trivially true).
PAC lower bound: sample complexity is at least linear in d/ε.
A₄ REPAIR: The original statement ∃ lower, lower ≤ SampleComplexity was
trivially true via ⟨0, Nat.zero_le _⟩. The corrected statement asserts the
SPECIFIC quantitative lower bound from learning theory:
m ≥ ⌈(d-1)/(64ε)⌉ for PAC learning with VCDim = d.
Note: the tight constant is (d-1)/(2ε) (EHKV 1989); see EHKV.lean.
Proof route: construct 2^d labelings on a shattered set of size d, use double-averaging + reversed Markov to show that m < (d-1)/(64ε) implies Pr[error ≤ ε] < 6/7 under uniform distribution on shattered set.
KU₂₀: The exact constant (1/7 vs 1/8 vs 1/4) depends on the proof technique. The factor (d-1) vs d also varies by source.
Any PAC witness (L, mf) gives an upper bound on SampleComplexity: the infimum is at most the witness sample size.
Quantitative sample-complexity sandwich attached to any PAC witness. Packages: (1) PAC guarantee, (2) SampleComplexity ≤ mf, (3) NFL/VC lower bound on both SampleComplexity and mf.
Fundamental theorem: finite VC dim ↔ finite compression scheme with side information. Moran-Yehudayoff 2016 (arXiv:1503.06960). Sorry-free via Compression.lean. Γ₇₃ RESOLVED: CompressionSchemeWithInfo parameterized by concept class C. The no-side-info version (Littlestone-Warmuth conjecture) remains open.
Fundamental theorem: Rademacher complexity characterization. BP₅: two asymmetric directions crossing different paradigm joints. Uses uniform vanishing (∃ m₀ ∀ D), which is the textbook-standard form.
Fundamental theorem of statistical learning (5-way equivalence, BP₅).
A₄ CORRECTION: The original NFL statement
¬ PACLearnable X Set.univ for [Fintype X] is PROVABLY FALSE.
For finite X: VCDim(Set.univ) = Fintype.card X < ⊤, so by vc_characterization
(← direction), Set.univ IS PAC-learnable (with sample complexity O(|X|/ε)).
The learner can take m ≥ |X| samples and memorize the entire domain.
The correct NFL theorem operates on INFINITE domains where VCDim = ∞.
NFL corollary: Set.univ over infinite domains is not PAC-learnable.
NFL for fixed sample size (Shalev-Shwartz & Ben-David Theorem 5.1): For any fixed sample size m, there exists a distribution such that any learner using m samples fails on SOME concept in Set.univ.
A₄ REPAIR: Original statement used ∃ D, (IsProbabilityMeasure D → ...)
which allows D = 0 (zero measure), making the implication vacuously true.
Corrected to ∃ D, IsProbabilityMeasure D ∧ ... (bundled conjunction).
Proof route:
- Let T ⊂ X with |T| = 2m (exists since 2m ≤ |X|)
- D = uniform over T: D = (1/2m) · Σ_{x ∈ T} δ_x
- For any learner L, average over random labelings c : T → Bool: E_c[TrueError(L(S), c, D)] ≥ 1/4 (the unseen points are random)
- By Markov: ∃ c with TrueError > 1/8 with positive probability
KU₂₁: Constructing uniform measure on T requires Fintype instance or manual construction via Finset.sum of Dirac measures. UK₁₀: The averaging-over-labelings step is where the counting argument lives — this is combinatorial, not measure-theoretic.
Occam's algorithm: any consistent learner with bounded description length is a PAC learner.
Hypotheses (what the theorem NEEDS but the current statement is MISSING):
- L is consistent: ∀ S, ∀ i, L.learn S (S i).1 = (S i).2
- Description length bound: ∀ c ∈ C, dl c ≤ k (for some k)
- Hoeffding: for iid sample of size m ≥ (1/ε)(k·ln 2 + ln(1/δ)), a consistent hypothesis has true error ≤ ε with probability ≥ 1-δ
The current hypothesis only states consistency. The dl parameter is unused. ABD-R: add hypothesis (∀ c ∈ C, dl c ≤ k) and set m accordingly.