Formal Learning Theory Kernel: Blueprint v1: Behaviorally Correct Learning

5.3.2 Behaviorally Correct Learning

Definition 5.7 Behaviorally Correct Learning ($\mathbf{BC}$)

A learner $M$ $\mathbf{BC}$-identifies a language $L$ from text if, for every text $t$ for $L$, there exists a time $n_0$ such that $W_{M(t_0, \ldots , t_n)} = L$ for all $n \geq n_0$. That is, from time $n_0$ onward, every hypothesis the learner outputs is extensionally correct (computes the right language), but the indices may keep changing.

The distinction between $\mathbf{Ex}$ and $\mathbf{BC}$ is subtle but consequential. $\mathbf{Ex}$ requires syntactic convergence: the index stabilizes. $\mathbf{BC}$ requires only semantic convergence: the computed language stabilizes, even if the learner keeps switching between different programs that all compute the same language. At first glance, this seems like a minor relaxation. It is not.

Theorem 5.8 Case–Smith [

$\mathbf{BC}$ is strictly more powerful than $\mathbf{Ex}$: there exists a class of languages that is $\mathbf{BC}$-identifiable but not $\mathbf{Ex}$-identifiable from text.

Proof ▼

We construct a class $\mathcal{L}$ that separates $\mathbf{BC}$ from $\mathbf{Ex}$.

For each total recursive function $f : \mathbb {N}\to \mathbb {N}$, define the language $L_f = \{ \langle n, f(n)\rangle : n \in \mathbb {N}\} $, where $\langle \cdot , \cdot \rangle $ is a computable pairing function. Let $\mathcal{L} = \{ L_f : f \text{ is total recursive}\} $.

$\mathcal{L}$ is $\mathbf{BC}$-identifiable. Given a text $t$ for some $L_f$, at time $n$ the learner has seen finitely many pairs $\langle n_i, m_i\rangle $. Define the partial function $g_n$ by $g_n(n_i) = m_i$ for all pairs seen so far, and $g_n(k) = 0$ for all other $k$. Output an index for the language $L_{g_n} = \{ \langle k, g_n(k)\rangle : k \in \mathbb {N}\} $. Once all pairs $\langle k, f(k)\rangle $ for $k \leq K$ have appeared (for any $K$), the hypothesis computes $L_f$ correctly on $\{ 0, \ldots , K\} $. As $n \to \infty $, $g_n$ converges pointwise to $f$. Since $f$ is total, for every $k$ there exists a time after which $g_n(k) = f(k)$. The language computed by the hypothesis is eventually $L_f$, though the index keeps changing as new pairs are absorbed. This is $\mathbf{BC}$-identification.

$\mathcal{L}$ is not $\mathbf{Ex}$-identifiable. Suppose for contradiction that a learner $M$ $\mathbf{Ex}$-identifies $\mathcal{L}$. We diagonalize against $M$. Define a total recursive function $f$ as follows.

Begin presenting pairs $\langle 0, 0\rangle , \langle 1, 0\rangle , \langle 2, 0\rangle , \ldots $ (corresponding to the zero function). Wait until $M$ converges to some index $e$. Since $M$ $\mathbf{Ex}$-identifies $\mathcal{L}$ and the zero function is total recursive, $M$ must converge. Now $W_e = L_{\mathbf{0}}$ (the language of the zero function).

Modify $f$: set $f(k) = 1$ for some large $k$ not yet presented. This changes the target to a different total recursive function, but the text presented so far is consistent with both targets. The learner $M$ has already committed to index $e$, which computes $L_{\mathbf{0}} \neq L_f$. By a careful effectivization of this argument (choosing $k$ computably based on $M$’s behavior), we construct $f \in \mathcal{L}$ on which $M$ fails. This contradicts $M$ $\mathbf{Ex}$-identifying all of $\mathcal{L}$.

The essential point is that $\mathbf{BC}$-identification does not require the index to stabilize, only the extension. The class $\mathcal{L}$ exploits this: the learner must continually update its program as new pairs arrive, and the programs keep changing, but they all eventually compute the same language. $\mathbf{Ex}$-identification cannot tolerate this perpetual updating of indices.

\begin{tikzpicture} [
criterion/.style={draw, thick, rounded corners, minimum width=2.4cm, minimum height=0.9cm, font=\small\bfseries, align=center},
branch/.style={draw, rounded corners, minimum width=2cm, minimum height=0.7cm, font=\small, align=center},
strict/.style={-{Stealth[length=2mm]}, thick, sepgreen!70!black},
weak/.style={-{Stealth[length=2mm]}, thin, dashed, notegray},
witlabel/.style={font=\tiny, text=edgeorange, fill=white, inner sep=1.5pt, align=center},
node distance=1.2cm
]

% Main chain
\node[criterion, fill=defblue!15] (bc) {$\BC$};
\node[criterion, fill=defblue!15, below=2cm of bc] (ex) {$\Ex$};
\node[criterion, fill=defblue!15, below=2cm of ex] (fin) {$\FIN$};

% Strict inclusions
\draw[strict] (fin) -- node[witlabel, right, xshift=3pt] {all finite langs\\are $\Ex$ not $\FIN$} (ex);
\draw[strict] (ex) -- node[witlabel, right, xshift=3pt] {Case--Smith 1983\\(Thm~\ref{thm:bc-strictly-stronger})} (bc);

% Branches off Ex
\node[branch, fill=edgeorange!8, right=2.5cm of ex] (anom) {Anomalous\\$\Ex^*$};
\node[branch, fill=edgeorange!8, left=2.5cm of ex] (mono) {Monotonic\\$\mathrm{Mon}\Ex$};
\node[branch, fill=goldcolor!10, below right=1cm and 2.5cm of ex] (vac) {Vacillatory\\$\mathrm{Vex}$};
\node[branch, fill=goldcolor!10, above right=0.5cm and 2.5cm of bc] (trial) {Trial \&\\Error};

% Relations
\draw[strict] (ex) -- node[witlabel, above] {\rel{restricts}} (anom);
\draw[weak] (mono) -- node[witlabel, above] {\rel{restricts}} (ex);
\draw[strict] (ex) -- node[witlabel, below right, xshift=-8pt] {\rel{restricts}} (vac);
\draw[weak] (bc) -- node[witlabel, above right, xshift=-5pt] {$\Delta^0_2$} (trial);

\end{tikzpicture} — Figure 5.1 The identification hierarchy. Solid arrows indicate strict inclusion of identifiable classes (more hypotheses identified at the target). $\mathbf{FIN}\subsetneq \mathbf{Ex}\subsetneq \mathbf{BC}$, with witnesses on each edge. Anomalous learning ($\mathbf{Ex}^*$) extends $\mathbf{Ex}$ by removing the zero-error constraint. Monotonic learning restricts $\mathbf{Ex}$ by forbidding hypothesis retraction. Vacillatory learning sits between $\mathbf{Ex}$ and $\mathbf{BC}$.

Path: $\textsf{\small fin\_ learning} \xrightarrow {\texttt{\small strictly\_ stronger}} \textsf{\small ex\_ learning} \xrightarrow {\texttt{\small strictly\_ stronger}} \textsf{\small bc\_ learning}$.
Each arrow is witnessed by an explicit separation. The graph encodes these as strictly_stronger edges with the separation witness as metadata. The hierarchy is not a sequence of increasingly liberal definitions, it is a chain of theorems, each proved by constructing a class that one criterion can learn and the other cannot.