Extended Theorems #
Universal trichotomy, computational hardness, advice reduction, meta-PAC bound, and separation results for compression and SQ dimension.
Advice Elimination Infrastructure #
Joint measurability of a sample-dependent advice learner's evaluation map.
Equations
- AdviceEvalMeasurable LA = ∀ (a : A) (m : ℕ), Measurable fun (p : (Fin m → X × Bool) × X) => LA.learnWithAdvice a p.1 p.2
Instances For
PAC learnability with finite advice, plus measurability for holdout validation.
Equations
- One or more equations did not get rendered due to their size.
Instances For
Choose the advice value with minimum validation empirical error.
Equations
- bestAdvice cand Sval = Classical.choose ⋯
Instances For
The advice-elimination learner applied to a labeled sample.
Equations
- adviceSelectedHypothesis LA S = (fun (a : A) => LA.learnWithAdvice a (adviceTrainSample S)) (bestAdvice (fun (a : A) => LA.learnWithAdvice a (adviceTrainSample S)) (adviceValSample S))
Instances For
Advice elimination (Ben-David & Dichterman 1998): If C is PAC-learnable with concept-dependent advice from a FINITE set A (with measurability regularity), then C is PAC-learnable without advice.
Proof strategy: run the advice-augmented learner with each a ∈ A on a training portion of the sample, producing |A| candidate hypotheses. Use a validation portion to select the candidate with lowest empirical error. Union bound over |A| advice values + Hoeffding on validation controls total failure probability. Sample complexity: O(m_orig(ε/2, δ/(2|A|)) + log(|A|/δ)/ε²).
The [Fintype A] constraint is essential: for infinite A, the theorem is false (no finite union bound). [Nonempty A] ensures the advice space is inhabited.
Meta-PAC bound: after seeing enough tasks, the meta-learner's output learner generalizes to new tasks from the same environment. The meta-learner's sample complexity over tasks is bounded by a function of ε, δ, and the complexity of the task environment.
Multi-Task Meta-Learning Infrastructure #
A task environment: a finite collection of concept classes (tasks) that a meta-learner is trained on. Each task is a concept class over the same domain X.
This is the formalization of Baxter (2000)'s "learning environment." In the full theory, tasks are drawn i.i.d. from a distribution over concept classes; here we use a finite deterministic collection as the base case.
- numTasks : ℕ
Number of training tasks
- tasks : Fin self.numTasks → ConceptClass X Bool
The concept classes for each task
Instances For
A meta-learner with PAC guarantees: given a task environment (training tasks), produces a BatchLearner and sample complexity function for new tasks.
Compared to MetaLearner (in Active.lean), this structure:
- takes a TaskEnvironment (multiple training tasks) rather than a single ConceptClass
- exposes the sample complexity function (not just the learner)
- is designed for quantitative PAC bounds, not just learnability
The key question: does seeing n training tasks reduce the per-task sample complexity on new tasks? Baxter (2000) shows the answer is yes under task similarity, but the NFL lower bound still applies per-task.
- learn : TaskEnvironment X → BatchLearner X Bool
Given training tasks, produce a learner for new tasks
- sampleComplexity : TaskEnvironment X → ℝ → ℝ → ℕ
Given training tasks, produce a sample complexity function
Instances For
A task sample environment: n training tasks, each with m samples. The meta-learner observes labeled samples from each task and must produce a learner for a new (unseen) task.
This extends TaskEnvironment by specifying sample sizes and the actual samples drawn. The meta-learner's output may depend on the samples but not on the true concepts.
- numTasks : ℕ
Number of training tasks
- samplesPerTask : ℕ
Samples per task
- taskClasses : Fin self.numTasks → ConceptClass X Bool
The concept classes (one per task)
The true concepts (one per task, each in its class)
Each true concept is in its class
Instances For
A sample-based meta-learner: sees labeled samples from n training tasks, produces a BatchLearner for new tasks. Unlike MetaLearnerPAC (which takes a TaskEnvironment directly), this meta-learner only sees the data, not the concept classes.
Given n × m labeled samples, produce a learner
Given n × m, produce sample complexity for the new task
Instances For
Baxter base case: any meta-learner's output is subject to the NFL lower bound. Even after seeing arbitrarily many training tasks, the meta-learner's output learner on a NEW task C_new with VCDim = d requires at least ⌈(d-1)/2⌉ samples.
This is the n=1 (single environment) base case of Baxter (2000). The full Baxter bound (n environments, per-task m ≥ d/(ε²·n)) requires multi-environment product measure infrastructure not yet built.
Proof: the meta-learner produces a BatchLearner L and sample complexity mf. If (L, mf) achieves PAC on C_new, then mf ε δ is a PAC-valid sample size, so pac_lower_bound_member gives ⌈(d-1)/2⌉ ≤ mf ε δ.
TODO: Strengthen to full Baxter bound with n training tasks giving per-task improvement m ≥ Ω(d/(ε²·n)). Requires TaskEnvironment distribution
- multi-task product measure infrastructure.
Baxter's multi-task lower bound: any sample-based meta-learner that achieves PAC on a new task C_new with VCDim = d, after seeing n training tasks with m samples each, requires ⌈(d-1)/2⌉ samples for the new task.
This is the n-independent version. The n-dependent improvement m ≥ Ω(d/(ε²·n)) requires the product-measure information-theoretic argument.
Key insight: the meta-learner's output (L, mf) is a PAC witness for C_new. By pac_lower_bound_member, any PAC witness requires at least ⌈(d-1)/2⌉ samples. The meta-learner's training phase (seeing n tasks) cannot reduce this bound because the new task's concept class is adversarially chosen AFTER training.
The n-dependent improvement (Baxter 2000, Theorem 3): For the PRODUCT measure over n tasks × m samples per task, the adversary argument gives m ≥ Ω(d/(ε²·n)). This requires:
- TaskDistribution: a measure over concept classes
- Product measure: D^(n×m) decomposed as (D^m)^n
- Information-theoretic counting: n·m bits vs 2^d labelings These are future infrastructure targets. The current theorem proves the n-INDEPENDENT base case which is already non-trivial.
VC dimension does not determine SQ hardness:
there exists a concept class with finite VC dimension but infinite SQ dimension
under some distribution at some positive correlation threshold.
Witness: singleton indicators on ℕ, with SQDimension = ⊤ at τ = 1.
For any probability D on ℕ, the correlation between distinct indicators 1_i, 1_j
is |1 - 2(D({i}) + D({j}))| ≤ 1, so every finite subset of C qualifies at τ = 1.
Since C is infinite, SQDimension = ⊤.
M-DefinitionRepair (Γ₈₄): added MeasurableSpace, existential over D and τ.
Previous statement had True placeholder due to missing SQDimension parameters.