Formal Learning Theory Kernel: Blueprint v1

5.3.2 Behaviorally Correct Learning

Definition 5.7 Behaviorally Correct Learning (\(\mathbf{BC}\))

A learner \(M\) \(\mathbf{BC}\)-identifies a language \(L\) from text if, for every text \(t\) for \(L\), there exists a time \(n_0\) such that \(W_{M(t_0, \ldots , t_n)} = L\) for all \(n \geq n_0\). That is, from time \(n_0\) onward, every hypothesis the learner outputs is extensionally correct (computes the right language), but the indices may keep changing.

The distinction between \(\mathbf{Ex}\) and \(\mathbf{BC}\) is subtle but consequential. \(\mathbf{Ex}\) requires syntactic convergence: the index stabilizes. \(\mathbf{BC}\) requires only semantic convergence: the computed language stabilizes, even if the learner keeps switching between different programs that all compute the same language. At first glance, this seems like a minor relaxation. It is not.

Theorem 5.8 Case–Smith [

\(\mathbf{BC}\) is strictly more powerful than \(\mathbf{Ex}\): there exists a class of languages that is \(\mathbf{BC}\)-identifiable but not \(\mathbf{Ex}\)-identifiable from text.

Proof

We construct a class \(\mathcal{L}\) that separates \(\mathbf{BC}\) from \(\mathbf{Ex}\).

For each total recursive function \(f : \mathbb {N}\to \mathbb {N}\), define the language \(L_f = \{ \langle n, f(n)\rangle : n \in \mathbb {N}\} \), where \(\langle \cdot , \cdot \rangle \) is a computable pairing function. Let \(\mathcal{L} = \{ L_f : f \text{ is total recursive}\} \).

\(\mathcal{L}\) is \(\mathbf{BC}\)-identifiable. Given a text \(t\) for some \(L_f\), at time \(n\) the learner has seen finitely many pairs \(\langle n_i, m_i\rangle \). Define the partial function \(g_n\) by \(g_n(n_i) = m_i\) for all pairs seen so far, and \(g_n(k) = 0\) for all other \(k\). Output an index for the language \(L_{g_n} = \{ \langle k, g_n(k)\rangle : k \in \mathbb {N}\} \). Once all pairs \(\langle k, f(k)\rangle \) for \(k \leq K\) have appeared (for any \(K\)), the hypothesis computes \(L_f\) correctly on \(\{ 0, \ldots , K\} \). As \(n \to \infty \), \(g_n\) converges pointwise to \(f\). Since \(f\) is total, for every \(k\) there exists a time after which \(g_n(k) = f(k)\). The language computed by the hypothesis is eventually \(L_f\), though the index keeps changing as new pairs are absorbed. This is \(\mathbf{BC}\)-identification.

\(\mathcal{L}\) is not \(\mathbf{Ex}\)-identifiable. Suppose for contradiction that a learner \(M\) \(\mathbf{Ex}\)-identifies \(\mathcal{L}\). We diagonalize against \(M\). Define a total recursive function \(f\) as follows.

Begin presenting pairs \(\langle 0, 0\rangle , \langle 1, 0\rangle , \langle 2, 0\rangle , \ldots \) (corresponding to the zero function). Wait until \(M\) converges to some index \(e\). Since \(M\) \(\mathbf{Ex}\)-identifies \(\mathcal{L}\) and the zero function is total recursive, \(M\) must converge. Now \(W_e = L_{\mathbf{0}}\) (the language of the zero function).

Modify \(f\): set \(f(k) = 1\) for some large \(k\) not yet presented. This changes the target to a different total recursive function, but the text presented so far is consistent with both targets. The learner \(M\) has already committed to index \(e\), which computes \(L_{\mathbf{0}} \neq L_f\). By a careful effectivization of this argument (choosing \(k\) computably based on \(M\)’s behavior), we construct \(f \in \mathcal{L}\) on which \(M\) fails. This contradicts \(M\) \(\mathbf{Ex}\)-identifying all of \(\mathcal{L}\).

The essential point is that \(\mathbf{BC}\)-identification does not require the index to stabilize, only the extension. The class \(\mathcal{L}\) exploits this: the learner must continually update its program as new pairs arrive, and the programs keep changing, but they all eventually compute the same language. \(\mathbf{Ex}\)-identification cannot tolerate this perpetual updating of indices.

\begin{tikzpicture} [
    criterion/.style={draw, thick, rounded corners, minimum width=2.4cm, minimum height=0.9cm, font=\small\bfseries, align=center},
    branch/.style={draw, rounded corners, minimum width=2cm, minimum height=0.7cm, font=\small, align=center},
    strict/.style={-{Stealth[length=2mm]}, thick, sepgreen!70!black},
    weak/.style={-{Stealth[length=2mm]}, thin, dashed, notegray},
    witlabel/.style={font=\tiny, text=edgeorange, fill=white, inner sep=1.5pt, align=center},
    node distance=1.2cm
]

% Main chain
\node[criterion, fill=defblue!15] (bc) {$\BC$};
\node[criterion, fill=defblue!15, below=2cm of bc] (ex) {$\Ex$};
\node[criterion, fill=defblue!15, below=2cm of ex] (fin) {$\FIN$};

% Strict inclusions
\draw[strict] (fin) -- node[witlabel, right, xshift=3pt] {all finite langs\\are $\Ex$ not $\FIN$} (ex);
\draw[strict] (ex) -- node[witlabel, right, xshift=3pt] {Case--Smith 1983\\(Thm~\ref{thm:bc-strictly-stronger})} (bc);

% Branches off Ex
\node[branch, fill=edgeorange!8, right=2.5cm of ex] (anom) {Anomalous\\$\Ex^*$};
\node[branch, fill=edgeorange!8, left=2.5cm of ex] (mono) {Monotonic\\$\mathrm{Mon}\Ex$};
\node[branch, fill=goldcolor!10, below right=1cm and 2.5cm of ex] (vac) {Vacillatory\\$\mathrm{Vex}$};
\node[branch, fill=goldcolor!10, above right=0.5cm and 2.5cm of bc] (trial) {Trial \&\\Error};

% Relations
\draw[strict] (ex) -- node[witlabel, above] {\rel{restricts}} (anom);
\draw[weak] (mono) -- node[witlabel, above] {\rel{restricts}} (ex);
\draw[strict] (ex) -- node[witlabel, below right, xshift=-8pt] {\rel{restricts}} (vac);
\draw[weak] (bc) -- node[witlabel, above right, xshift=-5pt] {$\Delta^0_2$} (trial);

\end{tikzpicture}
Figure 5.1 The identification hierarchy. Solid arrows indicate strict inclusion of identifiable classes (more hypotheses identified at the target). \(\mathbf{FIN}\subsetneq \mathbf{Ex}\subsetneq \mathbf{BC}\), with witnesses on each edge. Anomalous learning (\(\mathbf{Ex}^*\)) extends \(\mathbf{Ex}\) by removing the zero-error constraint. Monotonic learning restricts \(\mathbf{Ex}\) by forbidding hypothesis retraction. Vacillatory learning sits between \(\mathbf{Ex}\) and \(\mathbf{BC}\).

Path: \(\textsf{\small fin\_ learning} \xrightarrow {\texttt{\small strictly\_ stronger}} \textsf{\small ex\_ learning} \xrightarrow {\texttt{\small strictly\_ stronger}} \textsf{\small bc\_ learning}\).
Each arrow is witnessed by an explicit separation. The graph encodes these as strictly_stronger edges with the separation witness as metadata. The hierarchy is not a sequence of increasingly liberal definitions, it is a chain of theorems, each proved by constructing a class that one criterion can learn and the other cannot.