Documentation

FLT_Proofs.Theorem.PAC

PAC Learning Theorems #

VC characterization, fundamental theorem (5-way equivalence), Sauer-Shelah, NFL, Occam's algorithm, PAC lower bound.

Key dependencies (K₁-K₃ from Mathlib) #

Break Point BP₅: Five Generalization Bounds #

The fundamental theorem bundles five characterizations with different type signatures. This conjunction IS the theorem.

Proof metaprogram for VCDim < ∞ → PACLearnable #

The standard proof has three layers:

  1. Sauer-Shelah: VCDim < ∞ → growth function is polynomial (Mathlib: card_shatterer_le_sum_vcDim, accessed via B₄ bridge)
  2. Uniform convergence: polynomial growth + Hoeffding → ∀D, P[|emp_err - true_err| > ε] < δ (Mathlib: measure_sum_ge_le_of_iIndepFun for concentration)
  3. ERM works: uniform convergence → any ERM learner PAC-learns C (Connects to BatchLearner via output_in_H)

The reverse direction PACLearnable → VCDim < ∞ uses a probabilistic construction: if VCDim = ∞, construct a distribution D where any learner fails with probability > δ for some ε.

Direction ←: finite VCDim implies PAC learnability.

PROOF ROUTE (via new infrastructure in Generalization.lean): Step 1: VCDim < ∞ → HasUniformConvergence (vcdim_finite_imp_uc) Sub-step 1a: Sauer-Shelah gives GrowthFunction bound Sub-step 1b: Symmetrization reduces UC to growth function counting Sub-step 1c: Concentration inequality closes the bound Step 2: HasUniformConvergence → PACLearnable (uc_imp_pac) Sub-step 2a: Construct ERM learner Sub-step 2b: ERM is consistent in realizable case Sub-step 2c: Consistent + UC → low TrueError

KU₁₈: C.Nonempty is needed for ERM but not stated as hypothesis. If C = ∅, then PACLearnable is vacuously true (∀ c ∈ C, ... is vacuous). But ERM needs a fallback hypothesis from C. Is this a genuine gap or does the empty case work out vacuously?

Counterdefinition (COUNTER-4): If the ERM approach fails for computational reasons (ERM is noncomputable, and we need a computable learner for computational learning theory), swap to the compression-based proof: VCDim < ∞ → finite compression scheme (Moran-Yehudayoff 2016) → compression scheme learner is PAC. Swap condition: When proving COMPUTATIONAL PAC learnability (polynomial time).

Direction →: PAC learnability implies finite VCDim.

PROOF ROUTE (via double-sample infrastructure in Generalization.lean): Step 1: Contrapositive — assume VCDim = ∞ Step 2: For m = mf(ε,δ), extract S with |S| = 2m shattered by C (uses WithTop.eq_top_iff_forall_ge, same as vcdim_univ_infinite) Step 3: Construct D = uniform on S (Finset.uniformMeasure?) KU₁₉: Mathlib's uniform measure on a finite set — does MeasureTheory.Measure.count / Finset.card give IsProbabilityMeasure? Step 4: Double-sample trick via GhostSample + symmetrization Step 5: Counting argument on restricted labelings

HC at this joint: Step 3 requires constructing a specific probability measure from a combinatorial object (the shattered set). This is a P₁→P₂ crossing. UK₉: The construction of the hard distribution is the only non-constructive step. Can it be made constructive? (Related to derandomization in learning.)

VC characterization: C is PAC-learnable iff VCDim(C) < ∞.

PROOF DECOMPOSITION: This theorem factors through the two directions above: ← : vcdim_finite_imp_uc + uc_imp_pac (in Generalization.lean) → : pac_imp_vcdim_finite (contrapositive via double-sample)

HC at this joint: The ← direction crosses from combinatorics (VCDim, GrowthFunction) to measure theory (Measure.pi, TrueError). The → direction crosses from measure theory back to combinatorics. Both crossings have HC > 0.

UK₈: The ↔ hides an ASYMMETRY: the ← proof is constructive (produces ERM), while the → proof is non-constructive (produces hard distribution).

theorem sauer_shelah_quantitative (X : Type u) [Fintype X] [DecidableEq X] (C : Finset (XBool)) (d : ) (hd : (conceptClassToFinsetFamily C).vcDim = d) (m : ) (hm : d m) :
GrowthFunction X (↑C) m iFinset.range (d + 1), m.choose i

Sauer-Shelah lemma: if VCDim(C) = d and m ≥ d, then the growth function is bounded by the polynomial Σᵢ₌₀ᵈ C(m,i).

This is the quantitative version. The bound is tight. For m ≥ d, Σᵢ₌₀ᵈ C(m,i) ≤ (em/d)^d.

Proof via Mathlib bridge:

  1. Bridge our Shatters to Finset.Shatters (B₃)
  2. Apply card_shatterer_le_sum_vcDim from Mathlib
  3. Bridge back to our GrowthFunction
theorem sauer_shelah (X : Type u) (C : ConceptClass X Bool) (d m : ) (_hd : VCDim X C = d) (_hm : d m) :
∃ (bound : ), GrowthFunction X C m bound

Weak Sauer-Shelah (legacy statement, trivially true).

theorem pac_lower_bound (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) (d : ) (hd : VCDim X C = d) (ε δ : ) ( : 0 < ε) (hε1 : ε 1 / 4) ( : 0 < δ) (hδ1 : δ 1) (hδ2 : δ 1 / 7) (hd_pos : 1 d) [MeasurableConceptClass X C] :
(d - 1) / 2⌉₊ SampleComplexity X C ε δ

PAC lower bound: sample complexity is at least linear in d/ε.

A₄ REPAIR: The original statement ∃ lower, lower ≤ SampleComplexity was trivially true via ⟨0, Nat.zero_le _⟩. The corrected statement asserts the SPECIFIC quantitative lower bound from learning theory: m ≥ ⌈(d-1)/(64ε)⌉ for PAC learning with VCDim = d. Note: the tight constant is (d-1)/(2ε) (EHKV 1989); see EHKV.lean.

Proof route: construct 2^d labelings on a shattered set of size d, use double-averaging + reversed Markov to show that m < (d-1)/(64ε) implies Pr[error ≤ ε] < 6/7 under uniform distribution on shattered set.

KU₂₀: The exact constant (1/7 vs 1/8 vs 1/4) depends on the proof technique. The factor (d-1) vs d also varies by source.

theorem sample_complexity_upper_of_pac_witness (X : Type u) [MeasurableSpace X] (C : ConceptClass X Bool) (L : BatchLearner X Bool) (mf : ) (hPAC : ∀ (ε δ : ), 0 < ε0 < δ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure DcC, (MeasureTheory.Measure.pi fun (x : Fin (mf ε δ)) => D) {xs : Fin (mf ε δ)X | D {x : X | L.learn (fun (i : Fin (mf ε δ)) => (xs i, c (xs i))) x c x} ENNReal.ofReal ε} ENNReal.ofReal (1 - δ)) (ε δ : ) :
0 < ε0 < δSampleComplexity X C ε δ mf ε δ

Any PAC witness (L, mf) gives an upper bound on SampleComplexity: the infimum is at most the witness sample size.

theorem pac_sample_complexity_sandwich (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) [MeasurableConceptClass X C] :
PACLearnable X C∃ (L : BatchLearner X Bool) (mf : ), (∀ (ε δ : ), 0 < ε0 < δ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure DcC, (MeasureTheory.Measure.pi fun (x : Fin (mf ε δ)) => D) {xs : Fin (mf ε δ)X | D {x : X | L.learn (fun (i : Fin (mf ε δ)) => (xs i, c (xs i))) x c x} ENNReal.ofReal ε} ENNReal.ofReal (1 - δ)) (∀ (ε δ : ), 0 < ε0 < δSampleComplexity X C ε δ mf ε δ) ∀ (d : ), VCDim X C = d∀ (ε δ : ), 0 < εε 1 / 40 < δδ 1δ 1 / 71 d(d - 1) / 2⌉₊ SampleComplexity X C ε δ (d - 1) / 2⌉₊ mf ε δ

Quantitative sample-complexity sandwich attached to any PAC witness. Packages: (1) PAC guarantee, (2) SampleComplexity ≤ mf, (3) NFL/VC lower bound on both SampleComplexity and mf.

Fundamental theorem: finite VC dim ↔ finite compression scheme with side information. Moran-Yehudayoff 2016 (arXiv:1503.06960). Sorry-free via Compression.lean. Γ₇₃ RESOLVED: CompressionSchemeWithInfo parameterized by concept class C. The no-side-info version (Littlestone-Warmuth conjecture) remains open.

Fundamental theorem: Rademacher complexity characterization. BP₅: two asymmetric directions crossing different paradigm joints. Uses uniform vanishing (∃ m₀ ∀ D), which is the textbook-standard form.

theorem fundamental_theorem (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) [MeasurableConceptClass X C] :
(PACLearnable X C VCDim X C < ) (VCDim X C < ∃ (k : ) (cs : CompressionSchemeWithInfo0 X Bool C), CompressionSchemeWithInfo.size cs = k) (VCDim X C < ε > 0, ∃ (m₀ : ), ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure Dmm₀, RademacherComplexity X C D m < ε) (PACLearnable X C∃ (L : BatchLearner X Bool) (mf : ), (∀ (ε δ : ), 0 < ε0 < δ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure DcC, (MeasureTheory.Measure.pi fun (x : Fin (mf ε δ)) => D) {xs : Fin (mf ε δ)X | D {x : X | L.learn (fun (i : Fin (mf ε δ)) => (xs i, c (xs i))) x c x} ENNReal.ofReal ε} ENNReal.ofReal (1 - δ)) (∀ (ε δ : ), 0 < ε0 < δSampleComplexity X C ε δ mf ε δ) ∀ (d : ), VCDim X C = d∀ (ε δ : ), 0 < εε 1 / 40 < δδ 1δ 1 / 71 d(d - 1) / 2⌉₊ SampleComplexity X C ε δ (d - 1) / 2⌉₊ mf ε δ) (VCDim X C < ∃ (d : ), ∀ (m : ), d mGrowthFunction X C m iFinset.range (d + 1), m.choose i)

Fundamental theorem of statistical learning (5-way equivalence, BP₅).

A₄ CORRECTION: The original NFL statement ¬ PACLearnable X Set.univ for [Fintype X] is PROVABLY FALSE.

For finite X: VCDim(Set.univ) = Fintype.card X < ⊤, so by vc_characterization
(← direction), Set.univ IS PAC-learnable (with sample complexity O(|X|/ε)).
The learner can take m ≥ |X| samples and memorize the entire domain.

The correct NFL theorem operates on INFINITE domains where VCDim = ∞.  

NFL for infinite domains: Set.univ has infinite VC dimension.

NFL corollary: Set.univ over infinite domains is not PAC-learnable.

theorem nfl_fixed_sample (X : Type u) [MeasurableSpace X] [Fintype X] [MeasurableSingletonClass X] (hX : 2 Fintype.card X) (m : ) (hm : 2 * m Fintype.card X) (L : BatchLearner X Bool) :
∃ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure D ∃ (c : XBool), (MeasureTheory.Measure.pi fun (x : Fin m) => D) {xs : Fin mX | D {x : X | L.learn (fun (i : Fin m) => (xs i, c (xs i))) x c x} > ENNReal.ofReal (1 / 8)} > 0

NFL for fixed sample size (Shalev-Shwartz & Ben-David Theorem 5.1): For any fixed sample size m, there exists a distribution such that any learner using m samples fails on SOME concept in Set.univ.

A₄ REPAIR: Original statement used ∃ D, (IsProbabilityMeasure D → ...) which allows D = 0 (zero measure), making the implication vacuously true. Corrected to ∃ D, IsProbabilityMeasure D ∧ ... (bundled conjunction).

Proof route:

  1. Let T ⊂ X with |T| = 2m (exists since 2m ≤ |X|)
  2. D = uniform over T: D = (1/2m) · Σ_{x ∈ T} δ_x
  3. For any learner L, average over random labelings c : T → Bool: E_c[TrueError(L(S), c, D)] ≥ 1/4 (the unseen points are random)
  4. By Markov: ∃ c with TrueError > 1/8 with positive probability

KU₂₁: Constructing uniform measure on T requires Fintype instance or manual construction via Finset.sum of Dirac measures. UK₁₀: The averaging-over-labelings step is where the counting argument lives — this is combinatorial, not measure-theoretic.

theorem occam_algorithm (X : Type u) [MeasurableSpace X] [MeasurableSingletonClass X] (C : ConceptClass X Bool) (L : BatchLearner X Bool) (dl : DescriptionLength (Concept X Bool)) (k : ) (_hk : cC, dl c k) (hvcdim : VCDim X C < ) [MeasurableConceptClass X C] :
(∀ {m : } (S : Fin mX × Bool) (i : Fin m), L.learn S (S i).1 = (S i).2)PACLearnable X C

Occam's algorithm: any consistent learner with bounded description length is a PAC learner.

Hypotheses (what the theorem NEEDS but the current statement is MISSING):

  1. L is consistent: ∀ S, ∀ i, L.learn S (S i).1 = (S i).2
  2. Description length bound: ∀ c ∈ C, dl c ≤ k (for some k)
  3. Hoeffding: for iid sample of size m ≥ (1/ε)(k·ln 2 + ln(1/δ)), a consistent hypothesis has true error ≤ ε with probability ≥ 1-δ

The current hypothesis only states consistency. The dl parameter is unused. ABD-R: add hypothesis (∀ c ∈ C, dl c ≤ k) and set m accordingly.