Generalization Error, Sample/Query/Label Complexity, ERM #
The numerical quantities that PAC learning bounds. Includes the canonical PAC learner (ERM).
Sample complexity of PAC learning: the minimum number of samples needed to achieve (ε,δ)-PAC learning. m_C(ε,δ) = sInf{m | ∃ L, ∀ D prob, ∀ c ∈ C, D^m{S : error(L(S)) ≤ ε} ≥ 1-δ}.
Equations
- One or more equations did not get rendered due to their size.
Instances For
Query complexity: minimum membership queries for exact learning. Formally: sInf { q | ∃ active learner that identifies c using ≤ q queries }. A4 NOTE: This definition is well-typed but CANNOT be computed without a query-counting oracle wrapper (ABD-R deferred). The sInf formulation is the mathematically correct definition even without the infrastructure.
Equations
Instances For
Label complexity: minimum labels for active PAC learning. Formally: sInf { k | ∃ active learner using ≤ k labels achieving PAC(ε,δ) }. A4 NOTE: The oracle model doesn't track label count. ABD-R: add queryCount field or label-tracking wrapper.
Equations
Instances For
Mistake bound: minimum worst-case mistakes for online learning of C.
Equations
- OptimalMistakeBound X C = ⨅ (M : ℕ), ⨅ (_ : MistakeBounded X Bool C M), ↑M
Instances For
Generalization error (true risk): expected loss under distribution D.
Instances For
Empirical error: average loss on a finite sample.
Equations
Instances For
Equations
- ermLearn X Y H loss hne S = if h : ∃ h₀ ∈ H, ∀ h' ∈ H, EmpiricalError X Y h₀ S loss ≤ EmpiricalError X Y h' S loss then h.choose else hne.some
Instances For
Empirical Risk Minimization (ERM): the canonical PAC learner. Selects h ∈ H minimizing EmpiricalError on the sample when a minimizer exists; falls back to an arbitrary h ∈ H otherwise. M-DefinitionRepair: added (hne : H.Nonempty) to resolve Nonempty witness.
Equations
Instances For
PAC Proof Infrastructure Layer #
The PAC proof (vc_characterization, vcdim_finite_imp_pac) requires three layers of infrastructure sitting BETWEEN the combinatorial side (VCDim, Shatters, GrowthFunction) and the measure-theoretic side (PACLearnable, Measure.pi):
P₁ (combinatorial): VCDim, Shatters, GrowthFunction, Sauer-Shelah ↓ [HC > 0 joint — TrueError bridges these] BRIDGE: TrueError, EmpiricalMeasureError, IsConsistentWith, UniformConvergence ↓ [HC > 0 joint — concentration inequalities] P₂ (measure-theoretic): PACLearnable, Measure.pi, IsProbabilityMeasure
The hidden channel at the first joint: TrueError is a MEASURE (ENNReal) in PACLearnable but GeneralizationError is an INTEGRAL (ℝ). These are not interchangeable without measurability hypotheses. The bridge between them is genuinely at HC > 0.
K4 was originally "Hoeffding blocks PAC proofs." K4 dissolves: Mathlib has
Real.one_sub_le_exp_neg and Real.one_sub_div_pow_le_exp_neg. The ACTUAL obstruction
is the missing definitions below.
TrueError: The measure-valued error #
PACLearnable uses D { x | h x ≠ c x } (ENNReal), not ∫ loss(h(x), c(x)) dD (ℝ).
This is the 0-1 loss specialized to the realizable case with set-measure semantics.
HC at ENNReal/ℝ joint: GeneralizationError (ℝ-valued integral) and TrueError (ENNReal-valued measure) coincide ONLY when:
- D is a probability measure
- The loss is 0-1
- {x | h x ≠ c x} is measurable
- The integral equals the measure of the disagreement set Without all four, the bridge is lossy.
Counterdefinition (COUNTER-1): If proofs need ℝ-valued error for real analysis
lemmas (e.g., ε-δ bounds with subtraction), swap to:
TrueErrorReal h c D := (D { x | h x ≠ c x }).toReal
Swap condition: Coh failure at any theorem using sub_lt or abs_sub on errors.
Counterdefinition (COUNTER-2): If agnostic proofs need distribution over X × Y:
AgnosticTrueError h D_XY := D_XY { p | h p.1 ≠ p.2 }
Swap condition: When proving agnostic PAC → realizable PAC direction.
True error (0-1 loss, realizable case): D-probability of disagreement. This is what PACLearnable's success event measures.
Instances For
True error in ℝ: for use in bounds involving subtraction/absolute value. COUNTER-1 of TrueError. The toReal bridge loses information when the measure is ⊤.
Equations
- TrueErrorReal X h c D = (TrueError X h c D).toReal
Instances For
Bridge: TrueError equals GeneralizationError under 0-1 loss when the disagreement set is measurable. This theorem sits at the HC > 0 joint between ENNReal and ℝ error worlds. KU₁: requires MeasurableSet {x | h x ≠ c x} — which needs [DecidableEq Bool] and measurability of h and c. What are the minimal measurability hypotheses? UK₁: For concept classes where membership is undecidable, this bridge may not have a clean computational witness.
EmpiricalMeasureError: ENNReal-valued empirical error #
EmpiricalError (in ℝ) counts training mistakes as an average. But PACLearnable compares TrueError (ENNReal) against ENNReal.ofReal ε. To connect ERM to PACLearnable, we need the empirical analogue in ENNReal.
HC at this joint: The empirical distribution is a discrete measure (sum of Dirac deltas).
The true distribution is an arbitrary probability measure. Uniform convergence is the
claim that these converge uniformly over H. The empirical measure IS a Measure —
Mathlib has MeasureTheory.Measure.sum and Finset.sum for constructing it.
Empirical measure: the uniform distribution over a finite sample. D̂_S = (1/m) Σᵢ δ_{xᵢ} where S = (x₁,...,xₘ). This is a probability measure when m > 0. UK₂: Is there a natural categorical structure here? The empirical measure is a functor from (Fin m → X) to Measure X. What does this functoriality buy for the proofs?
Equations
- EmpiricalMeasure X xs = if _hm : m = 0 then 0 else (1 / ↑m) • ∑ i : Fin m, MeasureTheory.Measure.dirac (xs i)
Instances For
Empirical 0-1 error as a measure value: D̂_S{x | h x ≠ c x}. For a finite sample, this equals (# mistakes) / m. Connects EmpiricalError (ℝ) to TrueError (ENNReal) via the empirical measure.
Equations
- EmpiricalMeasureError X h c xs = TrueError X h c (EmpiricalMeasure X xs)
Instances For
Bridge: EmpiricalMeasureError equals the counting-based EmpiricalError under 0-1 loss (up to ENNReal ↔ ℝ conversion). KU₄: The division by m creates a rational, not necessarily a real. Does ENNReal.ofReal (k/m) = (k : ENNReal) / (m : ENNReal)?
IsConsistentWith: Consistency of hypotheses with samples #
A hypothesis is consistent with a labeled sample if it correctly classifies every point in the sample. This is the realizability assumption at the sample level.
Inv assessment: Robust across PAC (ERM output), Gold (version space membership), and compression (reconstructed hypothesis). Inv = 0.8.
A hypothesis h is consistent with labeled sample S.
Equations
- IsConsistentWith X Y h S = ∀ (i : Fin m), h (S i).1 = (S i).2
Instances For
Consistency implies zero empirical 0-1 error.
A loss function is faithful if: loss(y,y) = 0 and loss(y₁,y₂) = 0 → y₁ = y₂. This ensures that zero empirical error ↔ consistency. A5-valid enrichment (Γ₃₉): adds structure to loss, doesn't simplify theorem.
Matching predictions have zero loss
Zero loss implies matching predictions
Instances For
EmpiricalError with a faithful loss is zero iff consistent.
Concentration Infrastructure (Bridge to Zhang's SubGaussian/EfronStein) #
The concentration inequalities needed for UC come from three layers:
- McDiarmid (bounded differences) → tail bounds for empirical error of a SINGLE h
- Union bound over GrowthFunction-many effective hypotheses
- Sauer-Shelah to control GrowthFunction by VCDim
This section adds the Lean4 infrastructure that BRIDGES our types to the concentration lemma types. The actual concentration lemmas are proved in:
- Zhang's EfronStein.lean: efronStein (variance decomposition)
- Zhang's SubGaussian.lean: subGaussian_tail_bound (exponential concentration)
- Google formal-ml exp_bound.lean: nnreal_exp_bound2 ((1-x)^k ≤ exp(-xk))
A function f : (Fin m → X) → ℝ has bounded differences if changing any one coordinate changes f by at most c. This is the hypothesis for McDiarmid's inequality. Prior art: Zhang's efronStein gives Var(f) ≤ Σ cᵢ² under this condition.
Equations
- HasBoundedDifferences f c = ∀ (xs : Fin m → X) (i : Fin m) (x' : X), |f xs - f (Function.update xs i x')| ≤ c
Instances For
EmpiricalError of a fixed hypothesis h is a bounded-difference function of the sample, with constant 1/m. Changing one sample point changes the average loss by at most 1/m (since each term is in [0,1] for zeroOneLoss).
Complement probability bound: if μ(bad) ≤ ofReal δ with 0 < δ ≤ 1 and μ is a probability measure, then μ(good) ≥ ofReal(1-δ) where good = compl(bad).
One-sided Hoeffding: for a fixed h with TrueError(h,c,D) = p > ε, the probability that h is consistent on all m IID samples is ≤ (1-ε)^m.
This is the Google formal-ml route (pac_finite_bound, nnreal_exp_bound2). Each sample point xᵢ ∈ {x | h(x) = c(x)} with probability 1-p ≤ 1-ε. IID: Pr[all m correct] = (1-p)^m ≤ (1-ε)^m.
Mathlib: Real.one_sub_div_pow_le_exp_neg gives (1-ε)^m ≤ exp(-εm).
This is SIMPLER than McDiarmid (one-sided, no expectation computation).
Sample-dependent covering lemma: for a FIXED sample xs : Fin m → X, the set of "bad" hypotheses in C (those consistent with c on xs but with true error > ε) can be covered by at most GrowthFunction(C,m) representative hypotheses.
The key insight: for a fixed sample xs, the consistency condition ∀ i, h(xs i) = c(xs i) depends only on h's restriction to {xs i | i}. Two hypotheses with the same restriction produce identical consistency predicates on xs. The number of distinct restrictions of C to any m-point set is at most GrowthFunction(C,m) by definition.
This is the SAMPLE-DEPENDENT version (Γ₆₅-fix): representatives are chosen per-sample, which is the standard approach in PAC learning proofs. The previous sample-independent version was unprovable for infinite concept classes.
VCDim < ⊤ → growth function polynomially bounded by partial binomial sum. Forward direction of fundamental_theorem conjunct 5. Uses Sauer-Shelah: GrowthFunction(m) ≤ ∑_{i≤d} C(m,i) where d = VCDim.
UniformConvergence: The bridge from VCDim to PAC #
Uniform convergence is the key property that makes finite VCDim imply PAC learnability. It says: with high probability over an iid sample, the empirical error of EVERY hypothesis in H is close to its true error.
∀ ε > 0, ∃ m₀, ∀ m ≥ m₀, D^m { xs | ∀ h ∈ H, |TrueError(h) - EmpError(h)| < ε } ≥ 1 - δ
This is STRONGER than PAC learnability (which only needs the ERM hypothesis to be good). Uniform convergence → PAC learnability (via ERM). The converse fails in general (agnostic PAC ≠ uniform convergence).
HC at this joint: The quantifier structure is critical. PACLearnable has ∃ L, ∀ D, ∀ c ∈ C, while UniformConvergence has ∀ D, ∀ h ∈ H. The universal quantifier over h IN the probability event (inside the measure) makes uniform convergence strictly stronger.
Counterdefinition (COUNTER-3): If we need a non-iid variant:
Replace Measure.pi with a general product measure (martingale convergence).
Swap condition: Online-to-batch conversion proofs or non-iid PAC extensions.
KU₈: The definition below uses TrueError (ENNReal) and requires converting to ℝ for the absolute value. Is there a cleaner formulation in pure ENNReal? UK₃: Uniform convergence over UNCOUNTABLE hypothesis classes requires measurability of the supremum — a deep issue in empirical process theory. What is the Lean4 type-theoretic status of this?
Uniform convergence of empirical error to true error over a hypothesis class. This is the property that makes finite VCDim → PAC learnability work. BP₅ connects here: this is ONE of the five characterizations.
M-DefinitionRepair (Γ₃₅ → Γ₄₁): The m₀ must be INDEPENDENT of D and c. That's what "uniform" means — convergence is uniform over all distributions and all target concepts. The original definition had m₀ depending on D and c, making uc_imp_pac unprovable (PACLearnable's mf must be independent of D, c). Repaired: ∃ m₀ is now BEFORE ∀ D, ∀ c. This STRENGTHENS the definition (A5-valid: adds content, doesn't simplify).
Equations
- One or more equations did not get rendered due to their size.
Instances For
Quantitative uniform convergence: with explicit sample complexity bound. m ≥ (8/ε²)(d·ln(2em/d) + ln(4/δ)) suffices for VC classes of dimension d. KU₉: The exact constant depends on the proof technique (symmetrization, chaining, Dudley entropy integral). Which gives the tightest bound?
Sample complexity function
- bound_works (ε δ : ℝ) : 0 < ε → 0 < δ → ∀ (D : MeasureTheory.Measure X), MeasureTheory.IsProbabilityMeasure D → ∀ (c : Concept X Bool) (m : ℕ), self.sampleBound ε δ ≤ m → (MeasureTheory.Measure.pi fun (x : Fin m) => D) {xs : Fin m → X | ∀ h ∈ H, |TrueErrorReal X h c D - EmpiricalError X Bool h (fun (i : Fin m) => (xs i, c (xs i))) (zeroOneLoss Bool)| < ε} ≥ ENNReal.ofReal (1 - δ)
The bound works: m ≥ sampleBound ε δ implies uniform convergence at (ε, δ)
Instances For
Uniform convergence implies PAC learnability via ERM. The ERM learner (which exists by ermLearn) achieves PAC learning when uniform convergence holds. This is the second half of vcdim_finite_imp_pac.
Concentration Inequality Bridge #
Mathlib provides the exponential inequality chain:
Real.one_sub_le_exp_neg: (1 - x) ≤ exp(-x)Real.one_sub_div_pow_le_exp_neg: (1 - t/n)^n ≤ exp(-t)
These dissolve the K4 obstruction. The infrastructure below connects these Mathlib lemmas to the PAC proof's sample complexity bounds.
KU₁₃: Hoeffding's inequality for sums of bounded random variables
is NOT in Mathlib as of 2026-03. But measure_sum_ge_le_of_iIndepFun
provides the core concentration bound for independent random variables.
What is the gap between what Mathlib provides and what we need?
UK₆: The PAC proof's concentration step requires a chain: (1) Sauer-Shelah gives growth function bound (2) Union bound over growth function many effective hypotheses (3) Concentration for each fixed hypothesis (Hoeffding or Chebyshev) (4) Combine via (2) and (3) The union bound step (2) is trivial. Step (3) is where Mathlib helps. The question is whether step (3) needs Hoeffding specifically or whether Chebyshev + Sauer-Shelah polynomial bound suffices.
Sample complexity for PAC learning with VCDim = d. The standard bound: m ≥ (C/ε)(d · log(1/ε) + log(1/δ)) for a universal constant C. KU₁₄: The exact constant C depends on the proof technique. Symmetrization gives C = 8, chaining gives C = 4. UK₇: Is there a proof-theoretic reason to prefer one constant over another?
Instances For
Gold identification does NOT imply PAC learnability. There exist concept classes that are EX-learnable (identifiable in the limit) but not PAC-learnable (no finite sample suffices for (ε,δ) bounds). Example: the class of all computable functions is EX-learnable but has VCDim = ∞ (hence not PAC-learnable). This is the PAC/Gold paradigm separation — HC > 0 at this joint.
Regret: cumulative excess loss of online learner vs best fixed hypothesis.
Equations
- Regret X Y L H seq T loss = L.cumulativeLoss loss (List.map seq (List.range T)) - sInf ((fun (h : Concept X Y) => fixedHypothesisLoss h loss (List.map seq (List.range T))) '' H)
Instances For
NFL Counting Infrastructure for vcdim_infinite_not_pac #
The core counting argument: for any hypothesis h on a finite set, the average number of disagreements with a uniformly random labeling is exactly half the set size. This implies existence of a labeling with many disagreements, which drives the NFL/PAC lower bound proofs.
For any h : α → Bool on a Fintype, the sum over all functions f : α → Bool of #{x | f x ≠ h x} equals |α| * 2^(|α| - 1). This is the key counting identity for the NFL theorem: each point x contributes 2^(|α|-1) to the sum (exactly half the functions disagree with h at x).
Pigeonhole consequence: for any h on a Fintype with card ≥ 2, there exists a function f disagreeing with h on more than |α|/4 points. This is the per-sample NFL counting lemma.
Markov-type bound on the number of labelings with few disagreements. For any h : α → Bool on a Fintype with |α| ≥ 1: 4 · #{f : #{x | f x ≠ h x} ≤ |α|/4} < 3 · 2^|α|. This is the core of the double-averaging argument for NFL/PAC lower bounds. Proof: Markov's inequality on agreements, using disagreement_sum_eq.
Uniform Measure Infrastructure #
For NFL and PAC lower bound proofs, we need uniform probability measures on finite sets. Given a Finset S ⊆ X with |S| > 0, the uniform measure on S is (1/|S|) · Σ_{x ∈ S} δ_x.
This is a special case of EmpiricalMeasure where all sample points are distinct. The key property: IsProbabilityMeasure for the uniform measure on a nonempty finite set.
KU₁₉ (from Google formal-ml): Google uses a bespoke probability_space wrapper. We use Mathlib's MeasureTheory.Measure directly. The uniform measure construction needs Measure.count normalized by Fintype.card, or a manual Dirac sum.
Uniform probability measure on a Fintype: (1/|X|) · count. This gives each point probability 1/|X|. Requires |X| > 0 (nonempty).
Equations
- uniformMeasure X _hne = (1 / ↑(Fintype.card X)) • MeasureTheory.Measure.count
Instances For
The uniform measure is a probability measure when X is nonempty and finite.
NFL core: for any learner and any finite domain, there exists a hard distribution and concept. Factors through uniformMeasure construction. This is the core argument used by both nfl_fixed_sample and pac_lower_bound.
PAC lower bound core: sample complexity is at least (d-1)/2. For any PAC learner with VCDim = d, at least ⌈(d-1)/2⌉ samples needed. Proof: construct d shattered points, uniform distribution, counting argument. Note: the tight constant is (d-1)/(2ε) (EHKV 1989); see EHKV.lean.
Pigeonhole core: compress is injective on C-realizable labelings. If two C-realizable samples over the same points with different labelings produce the same compressed set, correctness forces the labelings to agree. Γ₇₃: now requires realizability hypotheses for both samples.
∃ compression scheme → VCDim < ⊤. Pigeonhole: compress is injective on C-realizable labelings (by correctness), but compressed subsets of an n-point sample are bounded. Shatters X C T guarantees ALL labelings are C-realizable, so injectivity holds on all 2^n labelings. Contradiction for large n. Γ₇₃ RESOLVED: CompressionScheme parameterized by C with realizability guard. Shattered sets guarantee C-realizability of every labeling, so the pigeonhole argument is genuinely non-vacuous.
Growth function polynomially bounded → VCDim < ⊤. Reverse direction: if GrowthFunction m ≤ ∑_{i≤d} C(m,i) for all m ≥ d, then VCDim ≤ d (otherwise GrowthFunction = 2^m for m = VCDim > d).
PAC lower bound membership: if m achieves PAC for C with VCDim = d, then m ≥ ⌈(d-1)/(64ε)⌉. This is the core adversarial counting argument factored for PAC.lean assembly. Note: the tight constant is (d-1)/(2ε) (EHKV 1989); see EHKV.lean.
Proof route (double-averaging on shattered set):
- VCDim = d → ∃ shattered S with |S| = d
- D = uniform on S (probability measure, each point has weight 1/d)
- m < ⌈(d-1)/(64ε)⌉ → 2m < d → NFL counting applies
- Double-averaging over 2^d labelings: E_f[E_xs[error]] ≥ (d-m)/(2d) > 1/4
- Reversed Markov: ∃ c₀ ∈ C with Pr[error ≤ 1/8] ≤ 6/7
- For ε ≤ 1/8: Pr[error ≤ ε] ≤ 6/7 = 1 - 1/7, contradicting PAC
Extract block j from a flat array of k*m elements, using finProdFinEquiv.
Equations
- block_extract k m S j i = S (finProdFinEquiv (j, i))
Instances For
Block index sets are disjoint for distinct blocks.
Block extraction is measurable: extracting block j from a pi-type is measurable.
Block extractions are independent under the product measure. Key infrastructure for boosting (D4) and probability amplification.
Shattering lifting: if T is shattered by C and f : ↥T → Bool, then there exists c ∈ C that agrees with f on all of T.
Key counting lemma: for any h : ↥T → Bool on a shattered set T with |T| ≥ 2, there exists c ∈ C with #{x ∈ T | c x ≠ h x} > |T|/4. Lifts exists_many_disagreements through shattering.
NFL per-sample lemma for shattered sets: for ANY fixed sample xs and ANY hypothesis h, there exists c ∈ C agreeing with h on sample points but with high error (> 1/4) on the shattered set T. Uses the counting argument on unseen points via disagreement_sum_eq.
If VCDim = ⊤, then C is not PAC learnable. Proof: for any learner L with sample function mf, pick ε = 1/4, δ = 1/4. Let m = mf(1/4, 1/4). Since VCDim = ⊤, ∃ shattered set S with |S| ≥ 2m. Put D = uniform on S. For random labeling, any m-sample learner has expected error ≥ 1/4 on unseen points. This is the core of pac_imp_vcdim_finite (contrapositive direction).