Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implicit Representations of Meaning in Neural Language Models

wing.nus
July 01, 2021
1.7k

Implicit Representations of Meaning in Neural Language Models

Presented video hosted on Youtube (with permission from presenter) at: https://youtu.be/BHQBkN4PyPc

Neural language models, which place probability distributions over sequences of words, produce vector representations of words and sentences that are useful for language processing tasks as diverse as machine translation, question answering, and image captioning. These models’ usefulness is partially explained by the fact that their representations robustly encode lexical and syntactic information. But the extent to which language model training also induces representations of meaning remains a topic of ongoing debate. I will describe recent work showing that language models—trained on text alone, without any kind of grounded supervision—build structured meaning representations that are used to simulate entities and situations as they evolve over the course of a discourse. These representations can be linearly decoded into logical representations of world state (e.g. discourse representation structures). They can also be directly manipulated to produce predictable changes in generated output. Together, these results suggest that (some) highly structured aspects of meaning can be recovered by relatively unstructured models trained on corpus data.

wing.nus

July 01, 2021
Tweet

More Decks by wing.nus

Transcript

  1. Janet and Penny went to the store to get presents

    for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension
  2. Janet and Penny went to the store to get presents

    for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension
  3. Janet and Penny went to the store to get presents

    for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension
  4. Janet and Penny went to the store to get presents

    for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension???
  5. Janet went to the store to get Jack a top.

    Janet store Jack get top Modeling the world described by language
  6. Janet store Jack get top loc agent beneficiary theme Modeling

    the world described by language Janet went to the store to get Jack a top.
  7. Janet store Jack get top loc agent beneficiary theme purple?

    possesses? Modeling the world described by language Janet went to the store to get Jack a top.
  8. Janet store Jack get top loc agent beneficiary theme Modeling

    the world described by language Janet went to the store to get Jack a top.
  9. Janet store Jack get top loc agent beneficiary theme colorful

    possesses Janet went to the store to get Jack a top. But Jack already has a colorful top. top Modeling the world described by language
  10. Janet store Jack top possesses She gave it to him.

    Modeling the world described by language Janet went to the store to get Jack a top.
  11. Dynamic Semantics Janet Penny Jack top possesses [Heim 83, “File

    Change Semantics”; Kamp 81, “Discourse Representation Theory”; Groenendijk & Stokhof 91, “Dynamic Predicate Logic”]
  12. World models & language models Janet Penny Jack top possesses

    p(“Jack will get a top” | …) Janet and Penny went to the store to get Jack a top. But Jack already has a colorful top.
  13. Representations in language models p(“Jack will get a top” |

    …) Janet and Penny went to the store to get Jack a top. But Jack already has a colorful top. vector representations
  14. Implicit representations of semantic state You see an open chest.

    The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder
  15. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder chest open possesses key you door locked Implicit representations of semantic state
  16. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder semantic probe LM decoder chest open possesses key you door locked Implicit representations of semantic state
  17. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state
  18. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state
  19. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state
  20. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state
  21. Building the probe You see an open chest. The only

    thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder the door is locked Proposition encoder LM encoder
  22. Building the probe You see an open chest. The only

    thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. LM encoder the door is locked Proposition encoder LM encoder Proposition 
 localizer Decode facts about an entity from the encoding 
 of its fi rst mention.
  23. Building the probe the door is locked Proposition encoder LM

    encoder You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. LM encoder Proposition 
 localizer Classifier W Train a linear model to predict truth value 
 of each proposition. = T
  24. Evaluation Alchemy You are navigating through a house. You've just

    entered a serious study. There is a gross looking mantle in the room. It has nothing on it. You see a closed rusty toolbox. Now why would someone leave that there? Looks like there is a locked door. Find the key to unlock the door. You should try going east. TextWorld
  25. Evaluation: does it work? 0 25 50 75 100 Alchemy

    TextWorld BART T5 No change No LM What fraction of entities are exactly reconstructed?
  26. Evaluation: does it work? What fraction of entities are exactly

    reconstructed? 0 25 50 75 100 Alchemy TextWorld BART T5 No change No LM
  27. Evaluation: does it work? What kind of training matters? 0

    25 50 75 100 Alchemy TextWorld T5 T5, no fi ne-tuning random init random init, no f-t
  28. Evaluation: locality What kind of training matters? 0 25 50

    75 100 All mentions First mention Last Mention T5 (TextWorld)
  29. Evaluation: locality T the [pos] be aker has [amount] [color]

    , the [pos]+1 [amount] [color] 39.4 41.4 42.1 41.5 58.5 68.6 74.3 64.8 45.1 35.0 35.7 34.5 42.1 58.1 42.4 40.7 64.8 66.7 67.9 63.5 42.1 41.9 32.4 32.9 the third be aker has 4 blue , the fourth 2 red … Drain 2 from beaker 3. Probe Localizer has-2-blue(beaker3) 58.5% / 64.8% accuracy Embed (a) (B) (T5)
  30. Language models as world models There’s a locked wooden door

    leading east […] you open the door. LM encoder locked(.) leads(., east) locked(.) ¬
  31. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very
  32. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p
  33. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p empty the third beaker
  34. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p empty the third beaker stir the red beaker
  35. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p 0 20 40 60 80 % generations consistent with combined ctx conditioned on 
 
 C1, C2, Cmix
  36. Does grounded training improve accuracy? You see an open chest.

    The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder semantic probe … predict text+state predict text # of training examples 0 10000
  37. Would ground-truth states improve accuracy? You unlock the door. LM

    encoder LM decoder chest open possesses key you door locked use state when predicting predict text+state predict text+state % of training examples with state labels 0% 100%
  38. What’s still missing Quanti fi cation: There are twenty-three reindeer;

    most of them have red noses. Implication and counterfactuals: If Pat goes to the party, so will Jan. If Pat had gone to the last one, Mo would have gone too. Pat will go to the party this time.
  39. Summary Language produce (rudimentary) representations of world states, and these

    states can be manipulated with predictable e ff ects on model output. But far from 100% reliable; lots of open questions about what these representations capture and how to improve them.
  40. Summary Language produce (rudimentary) representations of world states, and these

    states can be manipulated with predictable e ff ects on model output. But far from 100% reliable; lots of open questions about what these representations capture and how to improve them.