Schema-learning and rebinding as mechanisms of in-context learning and emergence
Sivaramakrishnan Swaminathan Antoine Dedieu Rajkumar Vasudeva Raju Murray Shanahan Miguel Lázaro-Gredilla Dileep George
In-context learning (ICL) behavior
Prompting Bard: The model has likely never been trained on this particular se-
quence, but it manages to recall and use the abstraction of reversing a list.
Clone-structured Causal Graphs (CSCGs) as an
interpretable sequence model
A CSCG uses latent states (aka “clones”) to disambiguate different contexts for the
same token, and then learns transitions between these latent states. It is essentially
a Hidden Markov Model with a deterministic emission matrix.
From sequences to abstractions, with rebinding
How can a CSCG trained on a list-reversal example be applied to a novel prompt?
By reserving the pattern of flow among the latent states as a “schema” and rebind-
ing surprising prompt tokens to “slots” (clone groups) in the schema.
Surprise-driven EM to select among abstractions
When there are multiple available abstractions, there is a chicken-and-egg problem
between schema retrieval and slot rebinding.
The process can be bootstrapped with prediction surprise on the prompt: unsur-
prising tokens (in cyan) act as “anchors” to first restrict to relevant schemas.
Surprising tokens (in magenta) then rebind to slots in the relevant schemas. This
helps finally select the correct schema and complete the prompt.
Mechanistic model for in-context learning
1. Learning schemas (template circuits) during training.
2. Retrieving schemas in a context-sensitive manner.
3. Rebinding surprising prompt tokens to appropriate slots.
The first happens during (pre)training, and the latter two happen in
tandem at test-time, driven by a prompt.
We reason by analogy that the same framework applies to other
sequence models such as transformers and RNNs.
Subsumes prior work on Bayesian ICL
GINC dataset, from “An Explanation of In-context Learning as Implicit Bayesian
Inference” (ICLR 2022)
Probing algorithmic behavior with a synthetic dataset
A Language Instructed Algorithm Learning Tasks
B Example learned circuit
Clone-stacked view
Unrolled view
Algorithms
Training set format and examples
repeat twice
reverse
print alternate even/odd
circ shift forward/backward
return nth element
return element at index
roll columns 1 step
transpose
diagonal
list
operations
matrix
operations
algok
language description / in1
algok
(in1
) /.../ inM
algok
(inM
) /
five variations
reverse the list / [ PZ LM RT ] [ RT LM PZ ] / [ QR FC JJ ] [ JJ FC QR ] / [ 2 r G J 7 ] [ 7 J G r 2 ] / [ a b c d ] [ d c b a ]
[ a b 1 d m ] a / [ X a 2 3 ] X
flip the list / [ QM AY JQ HH ] [ HH JQ AY QM ] /
Test set 1: instruction based retrieval
Test set 2: example based retrieval
language instruction / novel input completion
in1
algok
(in1
) / in2
completion
reverse the list / [ XY KL MR] [ MR KL XY]
prompt
prompt
reverse the list
reverse the list
In-context accuracy by task
Overallocation ratio
Instruction
based
prompts
Example
based
prompts
In-context
accuracy
In-context
accuracy
0
0
1
1
0
0
1
1
0.1 0.3 1.0 3.0
0.1 0.3 1.0 3.0
Overallocation ratio
A proposal for retrieval and rebinding in transformers
content-based
predictor
position-based
predictor
content & position
based predictor
0 1 2 3 4 5
[ A B C D ]
template
0 1 2 3 4 5
[ A B C D ]
template
0 1 2 3 4 5
[ A B C D ]
template
gating gating
0 1 2 3 4 5
[ A B C D ] [ A A B B C C D D ] [ P Q R S T ] [ P
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Template 1 evaluated at di erent o sets
Template 1 evaluated at di erent o sets
Template 1 evaluated at di erent o sets
Template N evaluated at di erent o sets
selection among templates
P
gating
0 1 2 3 4 5
[ A B C D ]
0 1 2 3 4 5
[ A B C D ]
0 1 2 3 4 5
[ A B C D ]
0 1 2 3 4 5
[ A B C D ] [ A A B B C C D D ] [ P Q R S T ] [ P
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Template 1 evaluated at di erent o sets
Template 1 evaluated at di erent o sets
Template 1 evaluated at di erent o sets
P
position
content
A B
template template template
gating gating
gating
selection among templates
Template N evaluated at different offsets
Learned templates in a transformer could involve content, position, or a mix of both.
Activations in the forward pass of a transformer could select among pre-learned
templates that mix content and position to achieve ICL without weight changes.
[email protected] arXiv:2307.01201