Implicit Representations of Meaning in Neural Language Models

Jacob Andreas MIT CSAIL LINGO.csail.mit.edu Implicit Representations of Meaning in
Neural Language Models

Belinda Li Max Nye Implicit Representations of Meaning   in
Neural Language Models [ACL 2021]

Janet and Penny went to the store to get presents
for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension

for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension

for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension???

Neural sequence models John has a book. Mary has an
apple. He gave her his

Modeling the world described by language Janet went to the
store to get Jack a top.

Janet went to the store to get Jack a top.
Janet store Jack get top Modeling the world described by language

Janet store Jack get top loc agent beneficiary theme Modeling
the world described by language Janet went to the store to get Jack a top.

Janet store Jack get top loc agent beneficiary theme purple?
possesses? Modeling the world described by language Janet went to the store to get Jack a top.

Janet store Jack get top loc agent beneficiary theme Modeling
the world described by language Janet went to the store to get Jack a top.

Janet store Jack get top loc agent beneficiary theme colorful
possesses Janet went to the store to get Jack a top. But Jack already has a colorful top. top Modeling the world described by language

Janet store Jack top possesses She gave it to him.
Modeling the world described by language Janet went to the store to get Jack a top.

Dynamic Semantics Janet Penny Jack top possesses [Heim 83, “File
Change Semantics”; Kamp 81, “Discourse Representation Theory”; Groenendijk & Stokhof 91, “Dynamic Predicate Logic”]

World models & language models Janet Penny Jack top possesses
p(“Jack will get a top” | …) Janet and Penny went to the store to get Jack a top. But Jack already has a colorful top.

Representations in language models p(“Jack will get a top” |
…) Janet and Penny went to the store to get Jack a top. But Jack already has a colorful top. vector representations

Implicit representations of semantic state You see an open chest.
The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder

You see an open chest. The only thing in the
chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder chest open possesses key you door locked Implicit representations of semantic state

chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder semantic probe LM decoder chest open possesses key you door locked Implicit representations of semantic state

chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state

Building the probe You see an open chest. The only
thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder the door is locked Proposition encoder LM encoder

Building the probe You see an open chest. The only
thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. LM encoder the door is locked Proposition encoder LM encoder Proposition   localizer Decode facts about an entity from the encoding   of its fi rst mention.

Building the probe the door is locked Proposition encoder LM
encoder You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. LM encoder Proposition   localizer Classifier W Train a linear model to predict truth value   of each proposition. = T

Evaluation Alchemy You are navigating through a house. You've just
entered a serious study. There is a gross looking mantle in the room. It has nothing on it. You see a closed rusty toolbox. Now why would someone leave that there? Looks like there is a locked door. Find the key to unlock the door. You should try going east. TextWorld

Evaluation: does it work? 0 25 50 75 100 Alchemy
TextWorld BART T5 No change No LM What fraction of entities are exactly reconstructed?

Evaluation: does it work? What fraction of entities are exactly
reconstructed? 0 25 50 75 100 Alchemy TextWorld BART T5 No change No LM

Evaluation: does it work? What kind of training matters? 0
25 50 75 100 Alchemy TextWorld T5 T5, no fi ne-tuning random init random init, no f-t

Evaluation: locality What kind of training matters? 0 25 50
75 100 All mentions First mention Last Mention T5 (TextWorld)

Evaluation: locality T the [pos] be aker has [amount] [color]
, the [pos]+1 [amount] [color] 39.4 41.4 42.1 41.5 58.5 68.6 74.3 64.8 45.1 35.0 35.7 34.5 42.1 58.1 42.4 40.7 64.8 66.7 67.9 63.5 42.1 41.9 32.4 32.9 the third be aker has 4 blue , the fourth 2 red … Drain 2 from beaker 3. Probe Localizer has-2-blue(beaker3) 58.5% / 64.8% accuracy Embed (a) (B) (T5)

Language models as world models There’s a locked wooden door
leading east […] you open the door. LM encoder locked(.) leads(., east) locked(.) ¬

Language models as fi le cards [Heim 1983!]

Building states from scratch ACL-IJCNLP 2021 Submission 713. Conﬁdential Review
Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would ﬁnd very

Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would ﬁnd very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the ﬁrst and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p

Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would ﬁnd very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the ﬁrst and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p empty the third beaker

Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would ﬁnd very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the ﬁrst and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p empty the third beaker stir the red beaker

Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would ﬁnd very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the ﬁrst and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p 0 20 40 60 80 % generations consistent with combined ctx conditioned on     C1, C2, Cmix

What’s still missing Attribution of model errors: p(probe is accurate)
p(generation   is semantically acceptable)

Does grounded training improve accuracy? You see an open chest.
The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder semantic probe … predict text+state predict text # of training examples 0 10000

Would ground-truth states improve accuracy? You unlock the door. LM
encoder LM decoder chest open possesses key you door locked use state when predicting predict text+state predict text+state % of training examples with state labels 0% 100%

What’s still missing Quanti fi cation: There are twenty-three reindeer;
most of them have red noses. Implication and counterfactuals: If Pat goes to the party, so will Jan. If Pat had gone to the last one, Mo would have gone too. Pat will go to the party this time.

Summary Language produce (rudimentary) representations of world states, and these
states can be manipulated with predictable e ff ects on model output. But far from 100% reliable; lots of open questions about what these representations capture and how to improve them.

RESEARCHERS SPONSORS Belinda Li Max Nye Thank you!

Summary Language produce (rudimentary) representations of world states, and these
states can be manipulated with predictable e ff ects on model output. But far from 100% reliable; lots of open questions about what these representations capture and how to improve them.

Implicit Representations of Meaning in Neural L...

Implicit Representations of Meaning in Neural Language Models

More Decks by wing.nus

Featured

Transcript