Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implicit Representations of Meaning in Neural Language Models

14da6ebc2e909305afdb348e7970de81?s=47 wing.nus
July 01, 2021
810

Implicit Representations of Meaning in Neural Language Models

Presented video hosted on Youtube (with permission from presenter) at: https://youtu.be/BHQBkN4PyPc

Neural language models, which place probability distributions over sequences of words, produce vector representations of words and sentences that are useful for language processing tasks as diverse as machine translation, question answering, and image captioning. These models’ usefulness is partially explained by the fact that their representations robustly encode lexical and syntactic information. But the extent to which language model training also induces representations of meaning remains a topic of ongoing debate. I will describe recent work showing that language models—trained on text alone, without any kind of grounded supervision—build structured meaning representations that are used to simulate entities and situations as they evolve over the course of a discourse. These representations can be linearly decoded into logical representations of world state (e.g. discourse representation structures). They can also be directly manipulated to produce predictable changes in generated output. Together, these results suggest that (some) highly structured aspects of meaning can be recovered by relatively unstructured models trained on corpus data.

14da6ebc2e909305afdb348e7970de81?s=128

wing.nus

July 01, 2021
Tweet

Transcript

  1. Jacob Andreas MIT CSAIL LINGO.csail.mit.edu Implicit Representations of Meaning in

    Neural Language Models
  2. Belinda Li Max Nye Implicit Representations of Meaning 
 in

    Neural Language Models [ACL 2021]
  3. Janet and Penny went to the store to get presents

    for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension
  4. Janet and Penny went to the store to get presents

    for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension
  5. Janet and Penny went to the store to get presents

    for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension
  6. Janet and Penny went to the store to get presents

    for Jack. Janet said, “I will buy Jack a top.” “Don’t get Jack a top,” said Penny. “He has a top. He will get a top.” “I will get Jack a top,” said Janet. [Brown et al. 2020; example from Marcus & Davis 2020 / Charniak 1972] Language comprehension???
  7. Neural sequence models John has a book. Mary has an

    apple. He gave her his
  8. Modeling the world described by language Janet went to the

    store to get Jack a top.
  9. Janet went to the store to get Jack a top.

    Janet store Jack get top Modeling the world described by language
  10. Janet store Jack get top loc agent beneficiary theme Modeling

    the world described by language Janet went to the store to get Jack a top.
  11. Janet store Jack get top loc agent beneficiary theme purple?

    possesses? Modeling the world described by language Janet went to the store to get Jack a top.
  12. Janet store Jack get top loc agent beneficiary theme Modeling

    the world described by language Janet went to the store to get Jack a top.
  13. Janet store Jack get top loc agent beneficiary theme colorful

    possesses Janet went to the store to get Jack a top. But Jack already has a colorful top. top Modeling the world described by language
  14. Janet store Jack top possesses She gave it to him.

    Modeling the world described by language Janet went to the store to get Jack a top.
  15. Dynamic Semantics Janet Penny Jack top possesses [Heim 83, “File

    Change Semantics”; Kamp 81, “Discourse Representation Theory”; Groenendijk & Stokhof 91, “Dynamic Predicate Logic”]
  16. World models & language models Janet Penny Jack top possesses

    p(“Jack will get a top” | …) Janet and Penny went to the store to get Jack a top. But Jack already has a colorful top.
  17. Representations in language models p(“Jack will get a top” |

    …) Janet and Penny went to the store to get Jack a top. But Jack already has a colorful top. vector representations
  18. Implicit representations of semantic state You see an open chest.

    The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder
  19. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder chest open possesses key you door locked Implicit representations of semantic state
  20. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder semantic probe LM decoder chest open possesses key you door locked Implicit representations of semantic state
  21. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state
  22. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state
  23. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state
  24. You see an open chest. The only thing in the

    chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder door locked T semantic probe Implicit representations of semantic state
  25. Building the probe You see an open chest. The only

    thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder the door is locked Proposition encoder LM encoder
  26. Building the probe You see an open chest. The only

    thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. LM encoder the door is locked Proposition encoder LM encoder Proposition 
 localizer Decode facts about an entity from the encoding 
 of its fi rst mention.
  27. Building the probe the door is locked Proposition encoder LM

    encoder You see an open chest. The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. LM encoder Proposition 
 localizer Classifier W Train a linear model to predict truth value 
 of each proposition. = T
  28. Evaluation Alchemy You are navigating through a house. You've just

    entered a serious study. There is a gross looking mantle in the room. It has nothing on it. You see a closed rusty toolbox. Now why would someone leave that there? Looks like there is a locked door. Find the key to unlock the door. You should try going east. TextWorld
  29. Evaluation: does it work? 0 25 50 75 100 Alchemy

    TextWorld BART T5 No change No LM What fraction of entities are exactly reconstructed?
  30. Evaluation: does it work? What fraction of entities are exactly

    reconstructed? 0 25 50 75 100 Alchemy TextWorld BART T5 No change No LM
  31. Evaluation: does it work? What kind of training matters? 0

    25 50 75 100 Alchemy TextWorld T5 T5, no fi ne-tuning random init random init, no f-t
  32. Evaluation: locality What kind of training matters? 0 25 50

    75 100 All mentions First mention Last Mention T5 (TextWorld)
  33. Evaluation: locality T the [pos] be aker has [amount] [color]

    , the [pos]+1 [amount] [color] 39.4 41.4 42.1 41.5 58.5 68.6 74.3 64.8 45.1 35.0 35.7 34.5 42.1 58.1 42.4 40.7 64.8 66.7 67.9 63.5 42.1 41.9 32.4 32.9 the third be aker has 4 blue , the fourth 2 red … Drain 2 from beaker 3. Probe Localizer has-2-blue(beaker3) 58.5% / 64.8% accuracy Embed (a) (B) (T5)
  34. Language models as world models There’s a locked wooden door

    leading east […] you open the door. LM encoder locked(.) leads(., east) locked(.) ¬
  35. Language models as fi le cards [Heim 1983!]

  36. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very
  37. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p
  38. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p empty the third beaker
  39. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p empty the third beaker stir the red beaker
  40. Building states from scratch ACL-IJCNLP 2021 Submission 713. Confidential Review

    Copy. DO NOT DISTRIBUTE. (Cmix) Information State Information State LM encoder (C2) (C1) LM encoder The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from first beaker. The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. % of generations consistent with... Context 1 Context 2 C1 96.2 21.6 Cmix 86.7 64.8 C2 24.1 87.7 Table 2: Intervention Experiments - Results. Though imperfect, Cmix is much more often consistent with Context 1 than C2 , and Context 2 than C1 , indicating that its underlying information state (approximately) believes both beakers to be empty. the time in tasks that most humans would find very 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 (Cmix) (g1) Mix the first beaker. (g3) Mix the third beaker. (g2) Mix the second beaker. Information State Information State LM decoder LM encoder (C2) The first beaker has 2 green, the second beaker has 2 red, the third beaker has 1 green. Drain 2 from second beaker. Inconsistent Inconsistent Consistent Figure 5: Intervention experiments. Construct C1, C2 by appending text to empty one of the beakers (in this case the first and second beakers) and encoding the re- sult. Then, create Cmix by taking encoded tokens from C1 and replacing the encodings corresponding to the second beaker’s initial state declaration with those from T i C t b t s p T b r n . p p 0 20 40 60 80 % generations consistent with combined ctx conditioned on 
 
 C1, C2, Cmix
  41. What’s still missing Attribution of model errors: p(probe is accurate)

    p(generation 
 is semantically acceptable)
  42. Does grounded training improve accuracy? You see an open chest.

    The only thing in the chest is an old key. There is a locked wooden door leading east. You pick up the key. You unlock the door. LM encoder LM decoder semantic probe … predict text+state predict text # of training examples 0 10000
  43. Would ground-truth states improve accuracy? You unlock the door. LM

    encoder LM decoder chest open possesses key you door locked use state when predicting predict text+state predict text+state % of training examples with state labels 0% 100%
  44. What’s still missing Quanti fi cation: There are twenty-three reindeer;

    most of them have red noses. Implication and counterfactuals: If Pat goes to the party, so will Jan. If Pat had gone to the last one, Mo would have gone too. Pat will go to the party this time.
  45. Summary Language produce (rudimentary) representations of world states, and these

    states can be manipulated with predictable e ff ects on model output. But far from 100% reliable; lots of open questions about what these representations capture and how to improve them.
  46. RESEARCHERS SPONSORS Belinda Li Max Nye Thank you!

  47. Summary Language produce (rudimentary) representations of world states, and these

    states can be manipulated with predictable e ff ects on model output. But far from 100% reliable; lots of open questions about what these representations capture and how to improve them.