PAC Learning Theorems #

VC characterization, fundamental theorem (5-way equivalence), Sauer-Shelah, NFL, Occam's algorithm, PAC lower bound.

Key dependencies (K₁-K₃ from Mathlib) #

K₁: Finset.vcDim + card_le_card_shatterer + card_shatterer_le_sum_vcDim (Sauer-Shelah)
K₂: lean-rademacher (Rademacher complexity bounds) — external, future import
K₃: Measure.pi (IsProbabilityMeasure instance for product measures)
K₄: measure_sum_ge_le_of_iIndepFun (Hoeffding's inequality)

Break Point BP₅: Five Generalization Bounds #

The fundamental theorem bundles five characterizations with different type signatures. This conjunction IS the theorem.

Proof metaprogram for VCDim < ∞ → PACLearnable #

The standard proof has three layers:

Sauer-Shelah: VCDim < ∞ → growth function is polynomial (Mathlib: card_shatterer_le_sum_vcDim, accessed via B₄ bridge)
Uniform convergence: polynomial growth + Hoeffding → ∀D, P[|emp_err - true_err| > ε] < δ (Mathlib: measure_sum_ge_le_of_iIndepFun for concentration)
ERM works: uniform convergence → any ERM learner PAC-learns C (Connects to BatchLearner via output_in_H)

The reverse direction PACLearnable → VCDim < ∞ uses a probabilistic construction: if VCDim = ∞, construct a distribution D where any learner fails with probability > δ for some ε.

source

theorem vcdim_finite_imp_pac (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) (hC : VCDim X C < ⊤) [MeasurableConceptClass X C] :

PACLearnable X C

Direction ←: finite VCDim implies PAC learnability.

PROOF ROUTE (via new infrastructure in Generalization.lean): Step 1: VCDim < ∞ → HasUniformConvergence (vcdim_finite_imp_uc) Sub-step 1a: Sauer-Shelah gives GrowthFunction bound Sub-step 1b: Symmetrization reduces UC to growth function counting Sub-step 1c: Concentration inequality closes the bound Step 2: HasUniformConvergence → PACLearnable (uc_imp_pac) Sub-step 2a: Construct ERM learner Sub-step 2b: ERM is consistent in realizable case Sub-step 2c: Consistent + UC → low TrueError

KU₁₈: C.Nonempty is needed for ERM but not stated as hypothesis. If C = ∅, then PACLearnable is vacuously true (∀ c ∈ C, ... is vacuous). But ERM needs a fallback hypothesis from C. Is this a genuine gap or does the empty case work out vacuously?

Counterdefinition (COUNTER-4): If the ERM approach fails for computational reasons (ERM is noncomputable, and we need a computable learner for computational learning theory), swap to the compression-based proof: VCDim < ∞ → finite compression scheme (Moran-Yehudayoff 2016) → compression scheme learner is PAC. Swap condition: When proving COMPUTATIONAL PAC learnability (polynomial time).

source

theorem pac_imp_vcdim_finite (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) (hC : PACLearnable X C) :

VCDim X C < ⊤

Direction →: PAC learnability implies finite VCDim.

PROOF ROUTE (via double-sample infrastructure in Generalization.lean): Step 1: Contrapositive — assume VCDim = ∞ Step 2: For m = mf(ε,δ), extract S with |S| = 2m shattered by C (uses WithTop.eq_top_iff_forall_ge, same as vcdim_univ_infinite) Step 3: Construct D = uniform on S (Finset.uniformMeasure?) KU₁₉: Mathlib's uniform measure on a finite set — does MeasureTheory.Measure.count / Finset.card give IsProbabilityMeasure? Step 4: Double-sample trick via GhostSample + symmetrization Step 5: Counting argument on restricted labelings

HC at this joint: Step 3 requires constructing a specific probability measure from a combinatorial object (the shattered set). This is a P₁→P₂ crossing. UK₉: The construction of the hard distribution is the only non-constructive step. Can it be made constructive? (Related to derandomization in learning.)

source

theorem vc_characterization (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) [MeasurableConceptClass X C] :

PACLearnable X C ↔ VCDim X C < ⊤

VC characterization: C is PAC-learnable iff VCDim(C) < ∞.

PROOF DECOMPOSITION: This theorem factors through the two directions above: ← : vcdim_finite_imp_uc + uc_imp_pac (in Generalization.lean) → : pac_imp_vcdim_finite (contrapositive via double-sample)

HC at this joint: The ← direction crosses from combinatorics (VCDim, GrowthFunction) to measure theory (Measure.pi, TrueError). The → direction crosses from measure theory back to combinatorics. Both crossings have HC > 0.

UK₈: The ↔ hides an ASYMMETRY: the ← proof is constructive (produces ERM), while the → proof is non-constructive (produces hard distribution).

source

theorem sauer_shelah_quantitative (X : Type u) [Fintype X] [DecidableEq X] (C : Finset (X → Bool)) (d : ℕ) (hd : (conceptClassToFinsetFamily C).vcDim = d) (m : ℕ) (hm : d ≤ m) :

GrowthFunction X (↑C) m ≤ ∑ i ∈ Finset.range (d + 1), m.choose i

Sauer-Shelah lemma: if VCDim(C) = d and m ≥ d, then the growth function is bounded by the polynomial Σᵢ₌₀ᵈ C(m,i).

This is the quantitative version. The bound is tight. For m ≥ d, Σᵢ₌₀ᵈ C(m,i) ≤ (em/d)^d.

Proof via Mathlib bridge:

Bridge our Shatters to Finset.Shatters (B₃)
Apply card_shatterer_le_sum_vcDim from Mathlib
Bridge back to our GrowthFunction

source

theorem sauer_shelah (X : Type u) (C : ConceptClass X Bool) (d m : ℕ) (_hd : VCDim X C = ↑d) (_hm : d ≤ m) :

∃ (bound : ℕ), GrowthFunction X C m ≤ bound

Weak Sauer-Shelah (legacy statement, trivially true).

source

theorem pac_lower_bound (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) (d : ℕ) (hd : VCDim X C = ↑d) (ε δ : ℝ) (hε : 0 < ε) (hε1 : ε ≤ 1 / 4) (hδ : 0 < δ) (hδ1 : δ ≤ 1) (hδ2 : δ ≤ 1 / 7) (hd_pos : 1 ≤ d) [MeasurableConceptClass X C] :

⌈(↑d - 1) / 2⌉₊ ≤ SampleComplexity X C ε δ

PAC lower bound: sample complexity is at least linear in d/ε.

A₄ REPAIR: The original statement ∃ lower, lower ≤ SampleComplexity was trivially true via ⟨0, Nat.zero_le _⟩. The corrected statement asserts the SPECIFIC quantitative lower bound from learning theory: m ≥ ⌈(d-1)/(64ε)⌉ for PAC learning with VCDim = d. Note: the tight constant is (d-1)/(2ε) (EHKV 1989); see EHKV.lean.

Proof route: construct 2^d labelings on a shattered set of size d, use double-averaging + reversed Markov to show that m < (d-1)/(64ε) implies Pr[error ≤ ε] < 6/7 under uniform distribution on shattered set.

KU₂₀: The exact constant (1/7 vs 1/8 vs 1/4) depends on the proof technique. The factor (d-1) vs d also varies by source.

source

theorem sample_complexity_upper_of_pac_witness (X : Type u) [MeasurableSpace X] (C : ConceptClass X Bool) (L : BatchLearner X Bool) (mf : ℝ → ℝ → ℕ) (hPAC : ∀ (ε δ : ℝ), 0 < ε → 0 < δ → ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure D → ∀ c ∈ C, (MeasureTheory.Measure.pi fun (x : Fin (mf ε δ)) => D) {xs : Fin (mf ε δ) → X | D {x : X | L.learn (fun (i : Fin (mf ε δ)) => (xs i, c (xs i))) x ≠ c x} ≤ ENNReal.ofReal ε} ≥ ENNReal.ofReal (1 - δ)) (ε δ : ℝ) :

0 < ε → 0 < δ → SampleComplexity X C ε δ ≤ mf ε δ

Any PAC witness (L, mf) gives an upper bound on SampleComplexity: the infimum is at most the witness sample size.

source

theorem pac_sample_complexity_sandwich (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) [MeasurableConceptClass X C] :

PACLearnable X C → ∃ (L : BatchLearner X Bool) (mf : ℝ → ℝ → ℕ), (∀ (ε δ : ℝ), 0 < ε → 0 < δ → ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure D → ∀ c ∈ C, (MeasureTheory.Measure.pi fun (x : Fin (mf ε δ)) => D) {xs : Fin (mf ε δ) → X | D {x : X | L.learn (fun (i : Fin (mf ε δ)) => (xs i, c (xs i))) x ≠ c x} ≤ ENNReal.ofReal ε} ≥ ENNReal.ofReal (1 - δ)) ∧ (∀ (ε δ : ℝ), 0 < ε → 0 < δ → SampleComplexity X C ε δ ≤ mf ε δ) ∧ ∀ (d : ℕ), VCDim X C = ↑d → ∀ (ε δ : ℝ), 0 < ε → ε ≤ 1 / 4 → 0 < δ → δ ≤ 1 → δ ≤ 1 / 7 → 1 ≤ d → ⌈(↑d - 1) / 2⌉₊ ≤ SampleComplexity X C ε δ ∧ ⌈(↑d - 1) / 2⌉₊ ≤ mf ε δ

Quantitative sample-complexity sandwich attached to any PAC witness. Packages: (1) PAC guarantee, (2) SampleComplexity ≤ mf, (3) NFL/VC lower bound on both SampleComplexity and mf.

source

theorem fundamental_vc_compression (X : Type u) (C : ConceptClass X Bool) :

VCDim X C < ⊤ ↔ ∃ (k : ℕ) (cs : CompressionSchemeWithInfo0 X Bool C), CompressionSchemeWithInfo.size cs = k

Fundamental theorem: finite VC dim ↔ finite compression scheme with side information. Moran-Yehudayoff 2016 (arXiv:1503.06960). Sorry-free via Compression.lean. Γ₇₃ RESOLVED: CompressionSchemeWithInfo parameterized by concept class C. The no-side-info version (Littlestone-Warmuth conjecture) remains open.

source

theorem fundamental_rademacher (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) [MeasurableConceptClass X C] :

PACLearnable X C ↔ ∀ ε > 0, ∃ (m₀ : ℕ), ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure D → ∀ m ≥ m₀, RademacherComplexity X C D m < ε

Fundamental theorem: Rademacher complexity characterization. BP₅: two asymmetric directions crossing different paradigm joints. Uses uniform vanishing (∃ m₀ ∀ D), which is the textbook-standard form.

source

theorem fundamental_theorem (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) [MeasurableConceptClass X C] :

(PACLearnable X C ↔ VCDim X C < ⊤) ∧ (VCDim X C < ⊤ ↔ ∃ (k : ℕ) (cs : CompressionSchemeWithInfo0 X Bool C), CompressionSchemeWithInfo.size cs = k) ∧ (VCDim X C < ⊤ ↔ ∀ ε > 0, ∃ (m₀ : ℕ), ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure D → ∀ m ≥ m₀, RademacherComplexity X C D m < ε) ∧ (PACLearnable X C → ∃ (L : BatchLearner X Bool) (mf : ℝ → ℝ → ℕ), (∀ (ε δ : ℝ), 0 < ε → 0 < δ → ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure D → ∀ c ∈ C, (MeasureTheory.Measure.pi fun (x : Fin (mf ε δ)) => D) {xs : Fin (mf ε δ) → X | D {x : X | L.learn (fun (i : Fin (mf ε δ)) => (xs i, c (xs i))) x ≠ c x} ≤ ENNReal.ofReal ε} ≥ ENNReal.ofReal (1 - δ)) ∧ (∀ (ε δ : ℝ), 0 < ε → 0 < δ → SampleComplexity X C ε δ ≤ mf ε δ) ∧ ∀ (d : ℕ), VCDim X C = ↑d → ∀ (ε δ : ℝ), 0 < ε → ε ≤ 1 / 4 → 0 < δ → δ ≤ 1 → δ ≤ 1 / 7 → 1 ≤ d → ⌈(↑d - 1) / 2⌉₊ ≤ SampleComplexity X C ε δ ∧ ⌈(↑d - 1) / 2⌉₊ ≤ mf ε δ) ∧ (VCDim X C < ⊤ ↔ ∃ (d : ℕ), ∀ (m : ℕ), d ≤ m → GrowthFunction X C m ≤ ∑ i ∈ Finset.range (d + 1), m.choose i)

Fundamental theorem of statistical learning (5-way equivalence, BP₅).

A₄ CORRECTION: The original NFL statement ¬ PACLearnable X Set.univ for [Fintype X] is PROVABLY FALSE.

For finite X: VCDim(Set.univ) = Fintype.card X < ⊤, so by vc_characterization
(← direction), Set.univ IS PAC-learnable (with sample complexity O(|X|/ε)).
The learner can take m ≥ |X| samples and memorize the entire domain.

The correct NFL theorem operates on INFINITE domains where VCDim = ∞.

source

theorem vcdim_univ_infinite (X : Type u) [Infinite X] :

VCDim X Set.univ = ⊤

NFL for infinite domains: Set.univ has infinite VC dimension.

source

theorem nfl_theorem_infinite (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] [Infinite X] :

¬PACLearnable X Set.univ

NFL corollary: Set.univ over infinite domains is not PAC-learnable.

source

theorem nfl_fixed_sample (X : Type u) [MeasurableSpace X] [Fintype X] [MeasurableSingletonClass X] (hX : 2 ≤ Fintype.card X) (m : ℕ) (hm : 2 * m ≤ Fintype.card X) (L : BatchLearner X Bool) :

∃ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure D ∧ ∃ (c : X → Bool), (MeasureTheory.Measure.pi fun (x : Fin m) => D) {xs : Fin m → X | D {x : X | L.learn (fun (i : Fin m) => (xs i, c (xs i))) x ≠ c x} > ENNReal.ofReal (1 / 8)} > 0

NFL for fixed sample size (Shalev-Shwartz & Ben-David Theorem 5.1): For any fixed sample size m, there exists a distribution such that any learner using m samples fails on SOME concept in Set.univ.

A₄ REPAIR: Original statement used ∃ D, (IsProbabilityMeasure D → ...) which allows D = 0 (zero measure), making the implication vacuously true. Corrected to ∃ D, IsProbabilityMeasure D ∧ ... (bundled conjunction).

Proof route:

Let T ⊂ X with |T| = 2m (exists since 2m ≤ |X|)
D = uniform over T: D = (1/2m) · Σ_{x ∈ T} δ_x
For any learner L, average over random labelings c : T → Bool: E_c[TrueError(L(S), c, D)] ≥ 1/4 (the unseen points are random)
By Markov: ∃ c with TrueError > 1/8 with positive probability

KU₂₁: Constructing uniform measure on T requires Fintype instance or manual construction via Finset.sum of Dirac measures. UK₁₀: The averaging-over-labelings step is where the counting argument lives — this is combinatorial, not measure-theoretic.

source

theorem occam_algorithm (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) (L : BatchLearner X Bool) (dl : DescriptionLength (Concept X Bool)) (k : ℕ) (_hk : ∀ c ∈ C, dl c ≤ k) (hvcdim : VCDim X C < ⊤) [MeasurableConceptClass X C] :

(∀ {m : ℕ} (S : Fin m → X × Bool) (i : Fin m), L.learn S (S i).1 = (S i).2) → PACLearnable X C

Occam's algorithm: any consistent learner with bounded description length is a PAC learner.

Hypotheses (what the theorem NEEDS but the current statement is MISSING):

L is consistent: ∀ S, ∀ i, L.learn S (S i).1 = (S i).2
Description length bound: ∀ c ∈ C, dl c ≤ k (for some k)
Hoeffding: for iid sample of size m ≥ (1/ε)(k·ln 2 + ln(1/δ)), a consistent hypothesis has true error ≤ ε with probability ≥ 1-δ

The current hypothesis only states consistency. The dl parameter is unused. ABD-R: add hypothesis (∀ c ∈ C, dl c ≤ k) and set m accordingly.

Documentation

FLT_Proofs.Theorem.PAC

PAC Learning Theorems #

Key dependencies (K₁-K₃ from Mathlib) #

Break Point BP₅: Five Generalization Bounds #

Proof metaprogram for VCDim < ∞ → PACLearnable #