Slide 1

Slide 1 text

Jacob Andreas MIT CSAIL LINGO.csail.mit.edu Implicit Representations of Meaning in Neural Language Models

Slide 2

Slide 2 text

Belinda Li Max Nye Implicit Representations of Meaning 
 in Neural Language Models [ACL 2021]

Slide 3

Slide 3 text

Janet and Penny went to the store to get presents for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension

Slide 4

Slide 4 text

Janet and Penny went to the store to get presents for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension

Slide 5

Slide 5 text

Janet and Penny went to the store to get presents for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension

Slide 6

Slide 6 text

Janet and Penny went to the store to get presents for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension???

Slide 7

Slide 7 text

Neural sequence models John has a book. Mary has an apple. He gave her his

Slide 8

Slide 8 text

Modeling the world described by language Janet went to the store to get Jack a top.

Slide 9

Slide 9 text

Janet went to the store to get Jack a top. Janet store Jack get top Modeling the world described by language

Slide 10

Slide 10 text

Janet store Jack get top loc agent beneficiary theme Modeling the world described by language Janet went to the store to get Jack a top.

Slide 11

Slide 11 text

Janet store Jack get top loc agent beneficiary theme purple? possesses? Modeling the world described by language Janet went to the store to get Jack a top.

Slide 12

Slide 12 text

Janet store Jack get top loc agent beneficiary theme Modeling the world described by language Janet went to the store to get Jack a top.

Slide 13

Slide 13 text

Janet store Jack get top loc agent beneficiary theme colorful possesses Janet went to the store to get Jack a top. But Jack already has a colorful top. top Modeling the world described by language

Slide 14

Slide 14 text

Janet store Jack top possesses She gave it to him. Modeling the world described by language Janet went to the store to get Jack a top.

Slide 15

Slide 15 text

Dynamic Semantics Janet Penny Jack top possesses [Heim 83, “File Change Semantics”; Kamp 81, “Discourse Representation Theory”; Groenendijk & Stokhof 91, “Dynamic Predicate Logic”]

Slide 16

Slide 16 text

World models & language models Janet Penny Jack top possesses p(“Jack will get a top” | …) Janet and Penny went to the store to get Jack a top. But Jack already has a colorful top.

Slide 17

Slide 17 text

Representations in language models p(“Jack will get a top” | …) Janet and Penny went to the store to get Jack a top. But Jack already has a colorful top. vector representations

Slide 18

Slide 18 text

Implicit representations of semantic state You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder

Slide 19

Slide 19 text

You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder chest open possesses key you door locked Implicit representations of semantic state

Slide 20

Slide 20 text

You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder semantic probe LM decoder chest open possesses key you door locked Implicit representations of semantic state

Slide 21

Slide 21 text

You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state

Slide 22

Slide 22 text

You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state

Slide 23

Slide 23 text

You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state

Slide 24

Slide 24 text

You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state

Slide 25

Slide 25 text

Building the probe You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder the door is locked Proposition encoder LM encoder

Slide 26

Slide 26 text

Building the probe You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. LM encoder the door is locked Proposition encoder LM encoder Proposition 
 localizer Decode facts about an entity from the encoding 
 of its fi rst mention.

Slide 27

Slide 27 text

Building the probe the door is locked Proposition encoder LM encoder You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. LM encoder Proposition 
 localizer Classifier W Train a linear model to predict truth value 
 of each proposition. = T

Slide 28

Slide 28 text

Evaluation Alchemy You are navigating through a house. You've just entered a serious study. There is a gross looking mantle in the room. It has nothing on it. You see a closed rusty toolbox. Now why would someone leave that there? Looks like there is a locked door. Find the key to unlock the door. You should try going east. TextWorld

Slide 29

Slide 29 text

Evaluation: does it work? 0 25 50 75 100 Alchemy TextWorld BART T5 No change No LM What fraction of entities are exactly reconstructed?

Slide 30

Slide 30 text

Evaluation: does it work? What fraction of entities are exactly reconstructed? 0 25 50 75 100 Alchemy TextWorld BART T5 No change No LM

Slide 31

Slide 31 text

Evaluation: does it work? What kind of training matters? 0 25 50 75 100 Alchemy TextWorld T5 T5, no fi ne-tuning random init random init, no f-t

Slide 32

Slide 32 text

Evaluation: locality What kind of training matters? 0 25 50 75 100 All mentions First mention Last Mention T5 (TextWorld)

Slide 33

Slide 33 text

Evaluation: locality T the [pos] be aker has [amount] [color] , the [pos]+1 [amount] [color] 39.4 41.4 42.1 41.5 58.5 68.6 74.3 64.8 45.1 35.0 35.7 34.5 42.1 58.1 42.4 40.7 64.8 66.7 67.9 63.5 42.1 41.9 32.4 32.9 the third be aker has 4 blue , the fourth 2 red … Drain 2 from beaker 3. Probe Localizer has-2-blue(beaker3) 58.5% / 64.8% accuracy Embed (a) (B) (T5)

Slide 34

Slide 34 text

Language models as world models There’s a locked wooden door leading east […] you open the door. LM encoder locked(.) leads(., east) locked(.) ¬

Slide 35

Slide 35 text

Language models as fi le cards [Heim 1983!]

Slide 36

Slide 36 text

Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very

Slide 37

Slide 37 text

Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p

Slide 38

Slide 38 text

Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p empty the third beaker

Slide 39

Slide 39 text

Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p empty the third beaker stir the red beaker

Slide 40

Slide 40 text

Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p 0 20 40 60 80 % generations consistent with combined ctx conditioned on 
 
 C1, C2, Cmix

Slide 41

Slide 41 text

What’s still missing Attribution of model errors: p(probe is accurate) p(generation 
 is semantically acceptable)

Slide 42

Slide 42 text

Does grounded training improve accuracy? You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder semantic probe … predict text+state predict text # of training examples 0 10000

Slide 43

Slide 43 text

Would ground-truth states improve accuracy? You unlock the door. LM encoder LM decoder chest open possesses key you door locked use state when predicting predict text+state predict text+state % of training examples with state labels 0% 100%

Slide 44

Slide 44 text

What’s still missing Quanti fi cation: There are twenty-three reindeer; most of them have red noses. Implication and counterfactuals: If Pat goes to the party, so will Jan. If Pat had gone to the last one, Mo would have gone too. Pat will go to the party this time.

Slide 45

Slide 45 text

Summary Language produce (rudimentary) representations of world states, and these states can be manipulated with predictable e ff ects on model output. But far from 100% reliable; lots of open questions about what these representations capture and how to improve them.

Slide 46

Slide 46 text

RESEARCHERS SPONSORS Belinda Li Max Nye Thank you!

Slide 47

Slide 47 text

Summary Language produce (rudimentary) representations of world states, and these states can be manipulated with predictable e ff ects on model output. But far from 100% reliable; lots of open questions about what these representations capture and how to improve them.