Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Commonsense Knowledge and Reasoning in Natural Language

14da6ebc2e909305afdb348e7970de81?s=47 wing.nus
July 16, 2021

Commonsense Knowledge and Reasoning in Natural Language

Natural language understanding models are trained on a sample of the real-world situations they may encounter. Commonsense and world knowledge, language, and reasoning skills can help them address unknown situations sensibly. In this talk I will discuss two lines of work, addressing knowledge and reasoning respectively. I will first present a method for discovering relevant knowledge which is unstated but may be required for solving a particular problem, through a process of asking information-seeking questions. I will then discuss nonmonotonic reasoning in natural language, a core human reasoning ability that has been studied in classical AI but mostly overlooked in modern NLP. I will talk about several recent papers addressing abductive reasoning (reasoning about plausible explanations), counterfactual reasoning (what if?) and defeasible reasoning (updating beliefs given additional information). Finally, I will discuss open problems and future directions in building NLP models with commonsense reasoning abilities.

14da6ebc2e909305afdb348e7970de81?s=128

wing.nus

July 16, 2021
Tweet

Transcript

  1. Commonsense Knowledge and Reasoning in Natural Language WING NUS NLP

    Seminar, July 2021 Vered Shwartz
  2. The LM Pre-training Revolution 2

  3. The LM Pre-training Revolution 2 0 25 50 75 100

    Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 90.3
  4. The LM Pre-training Revolution 2 0 25 50 75 100

    Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 +UNCPIWCIGWPFGTUVCPFKPI PGCTN[ UQNXGF! 90.3
  5. The LM Pre-training Revolution 2 0 25 50 75 100

    Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 +UNCPIWCIGWPFGTUVCPFKPI PGCTN[ UQNXGF! 90.3 Translation
  6. The LM Pre-training Revolution 2 0 25 50 75 100

    Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 +UNCPIWCIGWPFGTUVCPFKPI PGCTN[ UQNXGF! 90.3 Translation Reading Comprehension
  7. The LM Pre-training Revolution 2 0 25 50 75 100

    Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 +UNCPIWCIGWPFGTUVCPFKPI PGCTN[ UQNXGF! 90.3 Translation Reading Comprehension Chatbots
  8. Is Natural Language Understanding Nearly Solved? 3 Pre-training

  9. Is Natural Language Understanding Nearly Solved? 3 Pre-training ✅ Syntax

    ✅ Word meanings ✅ Factual Knowledge ✅ …
  10. Is Natural Language Understanding Nearly Solved? 3 Language Model Pre-training

    Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ …
  11. Is Natural Language Understanding Nearly Solved? 3 Language Model The

    amazing is cake chocolate Pre-training Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ …
  12. Is Natural Language Understanding Nearly Solved? 3 Language Model The

    amazing is cake chocolate + - 94.6% 5.4% Pre-training Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ …
  13. Is Natural Language Understanding Nearly Solved? 3 Language Model The

    amazing is cake chocolate + - 94.6% 5.4% Pre-training Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ … ✅ Understanding the task ✅ Learning to solve the task
  14. Is Natural Language Understanding Nearly Solved? 3 Language Model The

    amazing is cake chocolate + - 94.6% 5.4% Pre-training Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ … ✅ Understanding the task ✅ Learning to solve the task ❓Generalization to unknown situations 9JCVCTGVJGTGOCKPKPIEJCNNGPIGU!
  15. Overfitting to Data-specific Spurious Correlations 4

  16. Overfitting to Data-specific Spurious Correlations 🤖: A horse standing in

    the grass. (Szegedy et al., 2015) 4
  17. Overfitting to Data-specific Spurious Correlations 🤖: A horse standing in

    the grass. (Szegedy et al., 2015) 4
  18. How many zebras? 🤖: 2 (Agrawal et al., 2016) Overfitting

    to Data-specific Spurious Correlations 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4
  19. How many zebras? 🤖: 2 (Agrawal et al., 2016) Overfitting

    to Data-specific Spurious Correlations How many dogs? 2 How many zebras? 2 How many giraffes? 2 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4
  20. How many zebras? 🤖: 2 (Agrawal et al., 2016) 🤖:

    contradiction (91.7%) (Gururangan, Swayamdipta, et al., 2018; Poliak et al., 2018) p: I only had a soup but it was very filling. h: I didn't eat a salad. Overfitting to Data-specific Spurious Correlations How many dogs? 2 How many zebras? 2 How many giraffes? 2 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4
  21. How many zebras? 🤖: 2 (Agrawal et al., 2016) 🤖:

    contradiction (91.7%) (Gururangan, Swayamdipta, et al., 2018; Poliak et al., 2018) p: I only had a soup but it was very filling. h: I didn't eat a salad. Overfitting to Data-specific Spurious Correlations How many dogs? 2 How many zebras? 2 How many giraffes? 2 p: The boy ran in the park. h: The boy didn’t run in the park. contradiction p: The boy ran in the park. h: The boy didn’t run in the park. contradiction p: The boy ran in the park. h: The boy didn’t run in the park. contradiction 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4
  22. How many zebras? 🤖: 2 (Agrawal et al., 2016) 🤖:

    contradiction (91.7%) (Gururangan, Swayamdipta, et al., 2018; Poliak et al., 2018) p: I only had a soup but it was very filling. h: I didn't eat a salad. Overfitting to Data-specific Spurious Correlations How many dogs? 2 How many zebras? 2 How many giraffes? 2 p: The boy ran in the park. h: The boy didn’t run in the park. contradiction p: The boy ran in the park. h: The boy didn’t run in the park. contradiction p: The boy ran in the park. h: The boy didn’t run in the park. contradiction …Solving datasets but not underlying tasks! 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4
  23. What is Commonsense? Introductory Tutorial on Commonsense Reasoning. Maarten Sap,

    Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. 5
  24. What is Commonsense? Introductory Tutorial on Commonsense Reasoning. Maarten Sap,

    Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. It’s a bad idea to touch a hot stove. 5
  25. What is Commonsense? Introductory Tutorial on Commonsense Reasoning. Maarten Sap,

    Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. It’s impolite to comment on people’s weight. It’s a bad idea to touch a hot stove. 5
  26. What is Commonsense? Introductory Tutorial on Commonsense Reasoning. Maarten Sap,

    Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. It’s impolite to comment on people’s weight. It’s a bad idea to touch a hot stove. Eating dinner comes before going to bed. 5
  27. What is Commonsense? Introductory Tutorial on Commonsense Reasoning. Maarten Sap,

    Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. It’s impolite to comment on people’s weight. It’s a bad idea to touch a hot stove. Eating dinner comes before going to bed. … 5
  28. Why Do NLP Models Need Commonsense? 6

  29. Why Do NLP Models Need Commonsense? Translation __ __ 6

    = yogurt with grass
  30. Why Do NLP Models Need Commonsense? Translation __ __ 6

    = yogurt with grass Reading Comprehension 4UFWJF8POEFSBOOPVODFT
 IF`MMCFIBWJOHLJEOFZTVSHFSZ
 EVSJOH-POEPODPODFSU
  31. Why Do NLP Models Need Commonsense? Chatbots Medical chatbot using

    OpenAI’s GPT-3 told a fake patient to kill themselves Translation __ __ 6 = yogurt with grass Reading Comprehension 4UFWJF8POEFSBOOPVODFT
 IF`MMCFIBWJOHLJEOFZTVSHFSZ
 EVSJOH-POEPODPODFSU
  32. Outline

  33. Outline • Introspective knowledge acquisition through asking questions


  34. Outline • Introspective knowledge acquisition through asking questions
 • Nonmonotonic

    reasoning in natural language
 
 

  35. Outline • Introspective knowledge acquisition through asking questions
 • Nonmonotonic

    reasoning in natural language
 
 
 • Open problems and future directions
  36. Outline • Introspective knowledge acquisition through asking questions
 • Nonmonotonic

    reasoning in natural language
 
 
 • Open problems and future directions
  37. Children need to eat more vegetables because they are healthy

    children vegetables Reasoning with Implicit Knowledge 9 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  38. Children need to eat more vegetables because they are healthy

    children vegetables 7FHFUBCMFTBSFIFBMUIZ Reasoning with Implicit Knowledge 9 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  39. Children need to eat more vegetables because they are healthy

    children vegetables 7FHFUBCMFTBSFIFBMUIZ &BUJOHWFHFUBCMFTDBONBLFZPVIFBMUIJFS Reasoning with Implicit Knowledge 9 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  40. Children need to eat more vegetables because they are healthy

    children vegetables 7FHFUBCMFTBSFIFBMUIZ &BUJOHWFHFUBCMFTDBONBLFZPVIFBMUIJFS 1FPQMFXBOUUPCFIFBMUIZ Reasoning with Implicit Knowledge 9 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  41. Children need to eat more vegetables because they are healthy

    children vegetables 7FHFUBCMFTBSFIFBMUIZ &BUJOHWFHFUBCMFTDBONBLFZPVIFBMUIJFS 1FPQMFXBOUUPCFIFBMUIZ Reasoning with Implicit Knowledge 9 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  42. Discovery Learning (Bruner, 1961) Children need to eat more vegetables

    because they are healthy children vegetables 10 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  43. Discovery Learning (Bruner, 1961) Children need to eat more vegetables

    because they are healthy children vegetables Learner 10 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  44. Discovery Learning (Bruner, 1961) Children need to eat more vegetables

    because they are healthy children vegetables Learner Self-Inquiry What are the properties of vegetables? 10 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  45. Discovery Learning (Bruner, 1961) Children need to eat more vegetables

    because they are healthy children vegetables Learner Self-Inquiry What are the properties of vegetables? Existing Knowledge Vegetables are full of vitamins. 10 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  46. Discovery Learning (Bruner, 1961) Children need to eat more vegetables

    because they are healthy children vegetables Learner New Facts Self-Inquiry What are the properties of vegetables? Existing Knowledge Vegetables are full of vitamins. 10 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  47. The Self-Talk Paradigm Children need to eat more vegetables because

    they are healthy children vegetables Neural Language Model Self-Inquiry What are the properties of vegetables? Existing Knowledge Vegetables are full of vitamins. Nested QA Main Question Main Answer 11 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  48. Answer choices: Children need to eat more vegetables because they

    are healthy. Context: children, vegetables Output Predicted answer choice: vegetables Input 12 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  49. Answer choices: Children need to eat more vegetables because they

    are healthy. Context: children, vegetables Knowledge Discovery Question Answering Output Predicted answer choice: Input 13 vegetables Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  50. Knowledge Discovery 14

  51. Knowledge Discovery 14

  52. Children need to eat more vegetables because they are healthy.

    Instance Knowledge Discovery 15 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  53. Children need to eat more vegetables because they are healthy.

    Instance Knowledge Discovery What is the purpose of Nested Question Prefix 15 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  54. WFHFUBCMFT  Children need to eat more vegetables because they

    are healthy. Instance Knowledge Discovery What is the purpose of Nested Question Prefix 15 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  55. The purpose of is Nested Answer Prefix WFHFUBCMFT  Children

    need to eat more vegetables because they are healthy. Instance Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. 15 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  56. The purpose of is Nested Answer Prefix WFHFUBCMFT  The

    purpose of vegetables is Children need to eat more vegetables because they are healthy. Instance Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. 15 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  57. The purpose of is Nested Answer Prefix WFHFUBCMFT  The

    purpose of vegetables is Children need to eat more vegetables because they are healthy. Instance Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. UPQSPWJEFBHPPECBTFPG OVUSJFOUTBOEFOFSHZ 15 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  58. The purpose of is Nested Answer Prefix WFHFUBCMFT  The

    purpose of vegetables is Children need to eat more vegetables because they are healthy. Instance The purpose of vegetables is to provide a good base of nutrients and energy. Nested Answers Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. UPQSPWJEFBHPPECBTFPG OVUSJFOUTBOEFOFSHZ 15 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  59. The purpose of is Nested Answer Prefix WFHFUBCMFT  The

    purpose of vegetables is Children need to eat more vegetables because they are healthy. Instance The purpose of vegetables is to provide a good base of nutrients and energy. Nested Answers The properties of being healthy are linked to the effects of exercise. The definition of healthy is quality of life that is free of diseases. Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. UPQSPWJEFBHPPECBTFPG OVUSJFOUTBOEFOFSHZ 15 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  60. Answer choices: Children need to eat more vegetables because they

    are healthy. Context: children, vegetables The purpose of vegetables is to provide a good base of nutrients and energy. The properties of being healthy are linked to the effects of exercise. The definition of healthy is quality of life that is free of diseases. LM What is the definition of healthy? Output Predicted answer choice: Input Knowledge Discovery Question Answering What are the properties of a What is the purpose of What is the main function of … 16 vegetables Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  61. Question Answering 17

  62. Question Answering 17

  63. Question Answering Children Vegetables Children need to eat more vegetables

    because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The definition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The definition of healthy is quality of life that is free of diseases. children vegetables 18 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  64. Question Answering Most plausible statement Children Vegetables Children need to

    eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The definition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The definition of healthy is quality of life that is free of diseases. children vegetables 18 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  65. Question Answering Most plausible statement Statement with best language model

    score Most plausible statement Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The definition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The definition of healthy is quality of life that is free of diseases. children vegetables 18 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  66. Question Answering Most plausible statement Statement with best language model

    score Most plausible statement Language Model 19 Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The definition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The definition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  67. Question Answering Most plausible statement Statement with best language model

    score Most plausible statement Children need … and Language Model energy p(need|Children) p(to|Children need) . . . p(energy| . . . ) p(<eos>| . . . ) 19 Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The definition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The definition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  68. Question Answering Most plausible statement Statement with best language model

    score Most plausible statement Children need … and Language Model energy p(need|Children) p(to|Children need) . . . p(energy| . . . ) p(<eos>| . . . ) score = − 1 n log( ) 19 Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The definition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The definition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  69. Question Answering Most plausible statement Statement with best language model

    score Most plausible statement Children need … and Language Model energy p(need|Children) p(to|Children need) . . . p(energy| . . . ) p(<eos>| . . . ) score = − 1 n log( ) 19 Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The definition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The definition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020. We’ll get back to this soon…
  70. Question Answering 20 Children Vegetables Children need to eat more

    vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The definition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The definition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  71. Answer choices: Children need to eat more vegetables because they

    are healthy. Context: children, vegetables LM What is the definition of healthy? LM Output Predicted answer choice: vegetables Input The purpose of vegetables is to provide a good base of nutrients and energy. The properties of being healthy are linked to the effects of exercise. The definition of healthy is quality of life that is free of diseases. Knowledge Discovery What are the properties of a What is the purpose of What is the main function of … Answer with most plausible statement Question Answering 21 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  72. Answer choices: Children need to eat more vegetables because they

    are healthy. Context: children, vegetables LM What is the definition of healthy? LM Output Predicted answer choice: vegetables Input The purpose of vegetables is to provide a good base of nutrients and energy. The properties of being healthy are linked to the effects of exercise. The definition of healthy is quality of life that is free of diseases. Knowledge Discovery What are the properties of a What is the purpose of What is the main function of … Answer with most plausible statement Question Answering 21 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  73. MC-TACO Temporal SocialIQa Social PIQA Physical CommonsenseQA WinoGrande General COPA

    Causal Commonsense Question Answering Tasks 22 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  74. MC-TACO Temporal SocialIQa Social PIQA Physical CommonsenseQA WinoGrande General COPA

    Causal Social Interaction QA Although Aubrey was older and stronger, they lost to Alex in arm wrestling. How would Alex feel as a result?
 1) they need to practice more. 2) ashamed. 3) boastful. Commonsense Question Answering Tasks 22 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  75. MC-TACO Temporal SocialIQa Social PIQA Physical CommonsenseQA WinoGrande General COPA

    Causal Choice of Plausible Alternatives The man broke his toe. What was the cause? 1) He got a hole in his sock. 2) He dropped a hammer on his foot. Social Interaction QA Although Aubrey was older and stronger, they lost to Alex in arm wrestling. How would Alex feel as a result?
 1) they need to practice more. 2) ashamed. 3) boastful. Commonsense Question Answering Tasks 22 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  76. Baselines Children need to eat more vegetables because they are

    healthy. 23 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  77. Baselines Children need to eat more vegetables because they are

    healthy. ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  78. Baselines Children need to eat more vegetables because they are

    healthy. Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  79. Baselines Children need to eat more vegetables because they are

    healthy. vegetables, healthy Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  80. Baselines Children need to eat more vegetables because they are

    healthy. vegetables, healthy vegetables healthy eating vegetables required for motivated by Vegetables are required for eating vegetables. Eating vegetables is motivated by being healthy. Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  81. Baselines Children need to eat more vegetables because they are

    healthy. vegetables, healthy vegetables healthy eating vegetables required for motivated by Vegetables are required for eating vegetables. Eating vegetables is motivated by being healthy. Vegetables are healthy. Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  82. Baselines Children need to eat more vegetables because they are

    healthy. vegetables, healthy vegetables healthy eating vegetables required for motivated by Vegetables are required for eating vegetables. Eating vegetables is motivated by being healthy. Vegetables are healthy. cause Because the children wanted to live longer. Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  83. No Inquiry Expert Knowledge Self-Talk Human 0 25 50 75

    100 Results 24 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  84. 1. Nested QA improves performance No Inquiry Expert Knowledge Self-Talk

    Human 0 25 50 75 100 Results 24 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  85. 1. Nested QA improves performance 2. Self-Talk performs similarly to

    models with expert knowledge No Inquiry Expert Knowledge Self-Talk Human 0 25 50 75 100 Results 24 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  86. 1. Nested QA improves performance 2. Self-Talk performs similarly to

    models with expert knowledge 3. Gap from human performance No Inquiry Expert Knowledge Self-Talk Human 0 25 50 75 100 Results 24 Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  87. 1. Nested QA improves performance 2. Self-Talk performs similarly to

    models with expert knowledge 3. Gap from human performance No Inquiry Expert Knowledge Self-Talk Human 0 25 50 75 100 Results 24 What should I ask about? Unsupervised Commonsense Question Answering with Self-Talk. 
 Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.
  88. Measuring Plausibility 25 Surface Form Competition: Why the Highest Probability

    Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.
  89. Measuring Plausibility 26 A human … whirlpool Language Model bath

    p(human|A) p(wants|A human) . . . p(bath| . . . ) p(<eos>| . . . ) score = − 1 n log( ) Standard language model probability: Zero-shot Models Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. argmaxi P(ansi |question) Predict:
  90. Measuring Plausibility 26 A human … whirlpool Language Model bath

    p(human|A) p(wants|A human) . . . p(bath| . . . ) p(<eos>| . . . ) score = − 1 n log( ) Standard language model probability: Zero-shot Models Confounders: (1) String length (2) Word frequency Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. argmaxi P(ansi |question) Predict:
  91. Measuring Plausibility 26 A human … whirlpool Language Model bath

    p(human|A) p(wants|A human) . . . p(bath| . . . ) p(<eos>| . . . ) score = − 1 n log( ) Standard language model probability: Normalize by length Zero-shot Models Confounders: (1) String length (2) Word frequency Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. argmaxi P(ansi |question) Predict:
  92. Measuring Plausibility 27 Surface Form Competition Surface Form Competition: Why

    the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. Different strings represent the same concept but compete for probability!
  93. Measuring Plausibility 27 Surface Form Competition Surface Form Competition: Why

    the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. Different strings represent the same concept but compete for probability! * Not considering multiple possible correct answers
  94. Measuring Plausibility 28 Surface Form Competition argmaxi P(ansi |question) Probability

    Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.
  95. Measuring Plausibility 28 Surface Form Competition argmaxi P(ansi |question) Probability

    Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.
  96. Measuring Plausibility 28 Surface Form Competition argmaxi P(ansi |question) P(ansi

    |domain) Domain Conditional PMI argmaxi P(ansi |question) Probability Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.
  97. Measuring Plausibility 28 Surface Form Competition argmaxi P(ansi |question) P(ansi

    |domain) Domain Conditional PMI argmaxi P(ansi |question) Probability e.g. "The answer is" Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.
  98. Measuring Plausibility 28 Surface Form Competition argmaxi P(ansi |question) P(ansi

    |domain) Domain Conditional PMI argmaxi P(ansi |question) Probability e.g. "The answer is" Prior probability of each answer Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.
  99. Measuring Plausibility 28 Surface Form Competition argmaxi P(ansi |question) P(ansi

    |domain) Domain Conditional PMI argmaxi P(ansi |question) Probability e.g. "The answer is" Prior probability of each answer Domain Conditional PMI consistently outperforms other zero-shot scoring methods across multiple-choice tasks (QA, entailment, text classification), for GPT-2 & GPT-3! Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. 
 Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.
  100. Outline • Introspective knowledge acquisition through asking questions
 • Nonmonotonic

    reasoning in natural language
 
 
 • Open problems and future directions
  101. Nonmonotonic Reasoning 30

  102. Nonmonotonic Reasoning ART (Bhagavatula et al., 2020) Most plausible Abductive

    reasoning 30
  103. Nonmonotonic Reasoning TimeTravel (Qin et al., 2019) Counterfactual reasoning What

    if? ART (Bhagavatula et al., 2020) Most plausible Abductive reasoning 30
  104. Nonmonotonic Reasoning TimeTravel (Qin et al., 2019) Counterfactual reasoning What

    if? -NLI (Rudinger et al., 2020) δ Defeasible reasoning Updating inferences with additional ART (Bhagavatula et al., 2020) Most plausible Abductive reasoning 30
  105. Abductive Reasoning (Peirce, 1965) Reason about the most plausible explanation

    for incomplete observations. 31
  106. Abductive Reasoning (Peirce, 1965) Reason about the most plausible explanation

    for incomplete observations. Sara wanted to make dinner for some guests. 31 ART (Bhagavatula et al., 2020)
  107. Abductive Reasoning (Peirce, 1965) Reason about the most plausible explanation

    for incomplete observations. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. 31 ART (Bhagavatula et al., 2020)
  108. Abductive Reasoning (Peirce, 1965) Reason about the most plausible explanation

    for incomplete observations. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. 31 ART (Bhagavatula et al., 2020)
  109. Abductive Reasoning (Peirce, 1965) Reason about the most plausible explanation

    for incomplete observations. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. But she didn’t know how to cook. 31 ART (Bhagavatula et al., 2020)
  110. Abductive Reasoning (Peirce, 1965) Reason about the most plausible explanation

    for incomplete observations. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. But she didn’t know how to cook. 31 ART (Bhagavatula et al., 2020) Useful for filling in gaps in story understanding
  111. Challenge: Language models are conditioned only on a past context

    Sara wanted to make dinner for some guests. 32
  112. Challenge: Language models are conditioned only on a past context

    Sara wanted to make dinner for some guests. GPT-2 "I'm going to go grab some rice noodles," she says. 32
  113. Challenge: Language models are conditioned only on a past context

    Sara wanted to make dinner for some guests. GPT-2 "I'm going to go grab some rice noodles," she says. Solution: compute loss w.r.t future constraints & backpropagate to the output 32
  114. Challenge: Language models are conditioned only on a past context

    Sara wanted to make dinner for some guests. GPT-2 "I'm going to go grab some rice noodles," she says. Solution: compute loss w.r.t future constraints & backpropagate to the output Inspiration: Image Style Transfer 
 (Gatys et al, 2016) ConvNe Backpropagation Output: Loss Inputs: Source Image Style 32
  115. X - past context Input Z - future constraints Sara

    wanted to make dinner for some guests. She had to order pizza for her friends instead. 33 DELOREAN DEcoding for nonmonotonic LOgical REAsoNing Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning. 
 Lianhui (Karen) Qin, Vered Shwartz, Peter West, Chandra Bhagavatula, Jena Hwang, Ronan Le Bras, Antoine Bosselut, and Yejin Choi. EMNLP 2020.
  116. Output Y - continuation • Fluent continuation of X •

    Satisfies the constraints Z X - past context Input Z - future constraints Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. 33 DELOREAN DEcoding for nonmonotonic LOgical REAsoNing Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning. 
 Lianhui (Karen) Qin, Vered Shwartz, Peter West, Chandra Bhagavatula, Jena Hwang, Ronan Le Bras, Antoine Bosselut, and Yejin Choi. EMNLP 2020.
  117. X - past context Z - future constraints Sara wanted

    to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Backward Pass Forward Pass Input 34 DELOREAN
  118. Initialization 34

  119. Sara wanted to make dinner for some guests. x1 x2

    xNX … X LM Z She had to order pizza for her friends instead. Initialization 35
  120. Sara wanted to make dinner for some guests. x1 x2

    xNX … X … ˜ y2 ˜ yN ˜ y1 LM Z She had to order pizza for her friends instead. Initialization 35
  121. Sara wanted to make dinner for some guests. x1 x2

    xNX … X … ˜ y2 ˜ yN ˜ y1 LM ˜ Y Z She had to order pizza for her friends instead. Initialization 35
  122. X - past context Z - future constraints Sara wanted

    to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass ˜ Y = decode N tokens from LMforward(X) 36 DELOREAN
  123. Backward Pass 37

  124. Backward Pass 37

  125. Sara wanted to make dinner for some guests. x1 x2

    xNX … X Z … ˜ y2 ˜ yN ˜ y1 LM She had to order pizza for her friends instead. LM Backward Pass 38
  126. Sara wanted to make dinner for some guests. x1 x2

    xNX … X Z … ˜ y2 ˜ yN ˜ y1 LM She had to order pizza for her friends instead. … She had to </s> z1 z2 … z3 zNZ LM Backward Pass 38
  127. Sara wanted to make dinner for some guests. x1 x2

    xNX … X … ˜ y2 ˜ yN ˜ y1 LM … She had to </s> z1 z2 … z3 zNZ Task-specific Loss Function LM Z She had to order pizza for her friends instead. Backward Pass 39
  128. Sara wanted to make dinner for some guests. x1 x2

    xNX … X … ˜ y2 ˜ yN ˜ y1 LM … She had to </s> z1 z2 … z3 zNZ ˜ yb 1 ˜ yb 2 ˜ yb N Backpropagation … Task-specific Loss Function LM Z She had to order pizza for her friends instead. Backward Pass 39
  129. Sara wanted to make dinner for some guests. x1 x2

    xNX … X … ˜ y2 ˜ yN ˜ y1 LM … She had to </s> z1 z2 … z3 zNZ ˜ yb 1 ˜ yb 2 ˜ yb N Backpropagation … Task-specific Loss Function LM Z She had to order pizza for her friends instead. Backward Pass Task-specific Loss Function Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) 40
  130. X - past context Z - future constraints Sara wanted

    to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) ˜ Y = decode N tokens from LMforward(X) 41 DELOREAN
  131. Forward Pass 42

  132. Forward Pass 42

  133. Sara wanted to make dinner for some guests. x1 x2

    xNX … X LM Z She had to order pizza for her friends instead. ˜ yb 1 ˜ yb 2 ˜ yb N ˜ yN ˜ y2 ˜ y1 … … Forward Pass 43
  134. Sara wanted to make dinner for some guests. x1 x2

    xNX … X LM z1 z2 … z3 zNZ Z She had to order pizza for her friends instead. ˜ yb 1 ˜ yb 2 ˜ yb N ˜ yN ˜ y2 ˜ y1 … … ˜ yf 1 ˜ yf 2 ˜ yf N … … Forward Pass 43
  135. X - past context Z - future constraints Sara wanted

    to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) ˜ Y = decode N tokens from LMforward(X) And mix with backward logits ˜ Y = decode N tokens from LMforward(X) 44 DELOREAN
  136. X - past context Z - future constraints Sara wanted

    to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) ˜ Y = decode N tokens from LMforward(X) And mix with backward logits ˜ Y = decode N tokens from LMforward(X) Generation 45 DELOREAN
  137. Generation

  138. Generation

  139. Sara wanted to make dinner for some guests. x1 x2

    xNX … X LM z1 z2 … z3 zNZ Z She had to order pizza for her friends instead. Generation ˜ yb 1 ˜ yb 2 ˜ yb N ˜ yN ˜ y2 ˜ y1 … … ˜ yf 1 ˜ yf 2 ˜ yf N … … 47
  140. Sara wanted to make dinner for some guests. x1 x2

    xNX … X LM Greedy decoding z1 z2 … z3 zNZ Z She had to order pizza for her friends instead. Y But she didn’t know how to cook. Generation ˜ yb 1 ˜ yb 2 ˜ yb N ˜ yN ˜ y2 ˜ y1 … … ˜ yf 1 ˜ yf 2 ˜ yf N … … 47
  141. X - past context Z - future constraints Sara wanted

    to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) ˜ Y = decode N tokens from LMforward(X) And mix with backward logits ˜ Y = decode N tokens from LMforward(X) Greedy decoding from ˜ Y 48 DELOREAN
  142. X - past context Z - future constraints Sara wanted

    to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation Initialization Input Backward Pass Forward Pass Greedy Decoding 49 DELOREAN
  143. X - past context Z - future constraints Sara wanted

    to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation Initialization Input Backward Pass Forward Pass Greedy Decoding Repeat
 T times 50 DELOREAN
  144. X - past context Z - future constraints Sara wanted

    to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation Initialization Input Backward Pass Forward Pass Greedy Decoding Repeat
 T times Select Best Y 51 DELOREAN
  145. Select Best Y

  146. Select Best Y

  147. Select that is most likely to follow and precede its

    adjacent sentences Y(t) Select Best Y 53
  148. score(Y(t)) = BERTNSP (XY(t), Z) + BERTNSP (X, Y(t)Z) Select

    that is most likely to follow and precede its adjacent sentences Y(t) P(She had to order pizza for her friends instead.| Sara wanted to make dinner for some guests. But she didn’t know how to cook.) P(But she didn’t know how to cook. She had to order pizza for her friends instead.|Sara wanted to make dinner for some guests.) Select Best Y 53
  149. Human Evaluation Results Abductive Reasoning 54

  150. Human Evaluation Results Abductive Reasoning Coherence 0 2.5 5 7.5

    10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 54
  151. Human Evaluation Results Abductive Reasoning Coherence 0 2.5 5 7.5

    10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 2.97 3.25 5.22 54
  152. Human Evaluation Results Abductive Reasoning Coherence 0 2.5 5 7.5

    10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 2.36 2.38 4.74 2.97 3.25 5.22 1. Outperforms unsupervised models substantially 54
  153. Human Evaluation Results Abductive Reasoning Coherence 0 2.5 5 7.5

    10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 3.14 3.75 5.1 2.36 2.38 4.74 2.97 3.25 5.22 1. Outperforms unsupervised models substantially 2. Competitive with supervised models! 54
  154. Human Evaluation Results Abductive Reasoning Coherence 0 2.5 5 7.5

    10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 7.83 8.23 8.79 3.14 3.75 5.1 2.36 2.38 4.74 2.97 3.25 5.22 1. Outperforms unsupervised models substantially 2. Competitive with supervised models! 3. Large gap from human performance 54
  155. Example Generations 55 Sara wanted to make dinner for some

    guests. She had to order pizza for her friends instead.
  156. Example Generations 1. She was thinking about the best way.

    55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.
  157. Example Generations Backward pass introduces: contrast! 1. She was thinking

    about the best way. 2.However, her cooking skills were the only thing that could make it a success. 55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.
  158. Example Generations Backward pass introduces: contrast! 1. She was thinking

    about the best way. 2.However, her cooking skills were the only thing that could make it a success. 3.But she couldn’t, because she was too busy with her work. 55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.
  159. Example Generations Backward pass introduces: contrast! 1. She was thinking

    about the best way. 2.However, her cooking skills were the only thing that could make it a success. 3.But she couldn’t, because she was too busy with her work. 4.But she didn’t have the money and she didn’t have her own kitchen. 55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.
  160. Example Generations Backward pass introduces: contrast! 1. She was thinking

    about the best way. 2.However, her cooking skills were the only thing that could make it a success. 3.But she couldn’t, because she was too busy with her work. 4.But she didn’t have the money and she didn’t have her own kitchen. 5.But she didn’t know how to cook. 55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.
  161. 56 Counterfactual Reasoning (Goodman, 1947) Reason about changes in outcomes

    given a change in conditions.
  162. 56 Counterfactual Reasoning (Goodman, 1947) Reason about changes in outcomes

    given a change in conditions. Useful for Argument Mining: If X would have happened, it would result in some unwanted outcome Y.
  163. 56 Counterfactual Reasoning (Goodman, 1947) Reason about changes in outcomes

    given a change in conditions. Useful for Argument Mining: If X would have happened, it would result in some unwanted outcome Y. Useful for Detecting Misinformation: Claim X is false because it entails claim Y which is known to be false.
  164. TimeTravel (Qin et al., 2019) Lisa was throwing a Halloween

    party. Original Story Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 57 All her friends were dressing up.
  165. TimeTravel (Qin et al., 2019) Lisa was throwing a Halloween

    party. Original Story Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 57 All her friends were dressing up. All her friends were dressing up. It was a Game of Thrones themed party. Lisa was throwing a Halloween party. Counterfactual Beginning
  166. TimeTravel (Qin et al., 2019) Lisa was throwing a Halloween

    party. Original Story Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 57 All her friends were dressing up. Alternative Ending: $ Adheres to the counterfactual beginning % Minimally edits the original ending All her friends were dressing up. It was a Game of Thrones themed party. Lisa was throwing a Halloween party. Counterfactual Beginning
  167. TimeTravel (Qin et al., 2019) Lisa was throwing a Halloween

    party. Original Story Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 57 All her friends were dressing up. Alternative Ending: $ Adheres to the counterfactual beginning % Minimally edits the original ending All her friends were dressing up. It was a Game of Thrones themed party. Lisa was throwing a Halloween party. Counterfactual Beginning Lisa thought about being a wizard how she would dress up as a Lannister, but she didn’t want to look like a Lannister. Then she decided on a scarier costume. She wanted to look like a Stark. Lisa dressed up like a vampire Stark.
  168. Output $ Adheres to the counterfactual story beginning % Minimally

    edits the original ending Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 58 Y - alternative ending
  169. Output $ Adheres to the counterfactual story beginning % Minimally

    edits the original ending Initialization Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 58 Y - alternative ending
  170. Output $ Adheres to the counterfactual story beginning % Minimally

    edits the original ending Initialization Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. Backward Pass Minimize the KL divergence between the original ending Z (one-hot representation) and generated ending ˜ Y ℒ(X, ˜ Y, Z) := KL (Z∥ softmax( ˜ Y/τ)) 58 Y - alternative ending
  171. Output $ Adheres to the counterfactual story beginning % Minimally

    edits the original ending Initialization Forward Pass Generation + Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. Backward Pass Minimize the KL divergence between the original ending Z (one-hot representation) and generated ending ˜ Y ℒ(X, ˜ Y, Z) := KL (Z∥ softmax( ˜ Y/τ)) Select best Y 58 Y - alternative ending
  172. Output $ Adheres to the counterfactual story beginning % Minimally

    edits the original ending Initialization Forward Pass Generation + Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. Backward Pass Minimize the KL divergence between the original ending Z (one-hot representation) and generated ending ˜ Y ℒ(X, ˜ Y, Z) := KL (Z∥ softmax( ˜ Y/τ)) Select best Y DeLorean was the only method to achieves a good balance between the two requirements 58 Y - alternative ending
  173. Defeasible Inference (Reiter, 1980) Given premise P, a hypothesis H

    is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. 59
  174. Defeasible Inference (Reiter, 1980) Given premise P, a hypothesis H

    is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. 59
  175. Defeasible Inference (Reiter, 1980) Given premise P, a hypothesis H

    is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. P: Tweety is a bird. 59
  176. Defeasible Inference (Reiter, 1980) Given premise P, a hypothesis H

    is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. P: Tweety is a bird. H: Tweety flies. 59
  177. Defeasible Inference (Reiter, 1980) Given premise P, a hypothesis H

    is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. P: Tweety is a bird. H: Tweety flies. U: Tweety is a penguin. 59
  178. Defeasible Inference (Reiter, 1980) Given premise P, a hypothesis H

    is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. P: Tweety is a bird. H: Tweety flies. U: Tweety is a penguin. 59 Useful for Real-time Summarization: Facts change as the story unfolds.
  179. Defeasible Inference in Natural Language An update U is called

    a weakener if, given a premise P and hypothesis H, a human would most likely find H less likely to be true after learning U; if they would find H more likely to be true, then we call U a strengthener. P: Tweety is a bird. H: Tweety flies. Weakener: Tweety is a penguin. Thinking Like a Skeptic: Defeasible Inference in Natural Language. 
 Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, and Yejin Choi. Findings of EMNLP 2020.
  180. Defeasible Inference in Natural Language An update U is called

    a weakener if, given a premise P and hypothesis H, a human would most likely find H less likely to be true after learning U; if they would find H more likely to be true, then we call U a strengthener. P: Tweety is a bird. H: Tweety flies. Weakener: Tweety is a penguin. Strengthener: Tweety is on a tree. Thinking Like a Skeptic: Defeasible Inference in Natural Language. 
 Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, and Yejin Choi. Findings of EMNLP 2020.
  181. Defeasible Inference in Natural Language 61

  182. Defeasible Inference in Natural Language Discriminative Task They have a

    work meeting. They are in a conference room. + Strengthener They are in a library. - Weakener A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. Determine whether an update weakens or strengthens the hypothesis. 61
  183. Defeasible Inference in Natural Language Discriminative Task They have a

    work meeting. They are in a conference room. + Strengthener They are in a library. - Weakener A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. Determine whether an update weakens or strengthens the hypothesis. 61 Generative Task They have a work meeting. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. They are in a conference room. + They are in a library. - Generate a weakening or strengthening update for a given premise-hypothesis pair.
  184. Defeasible Inference in Natural Language Language models leave plenty of

    room for improvement on the generative task! Discriminative Task They have a work meeting. They are in a conference room. + Strengthener They are in a library. - Weakener A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. Determine whether an update weakens or strengthens the hypothesis. 61 Generative Task They have a work meeting. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. They are in a conference room. + They are in a library. - Generate a weakening or strengthening update for a given premise-hypothesis pair.
  185. Rationale Generation for Defeasible Inference 62 Learning to Rationalize for

    Nonmonotonic Reasoning with Distant Supervision. Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. AAAI 2021.
  186. Rationale Generation for Defeasible Inference They have a work meeting.

    + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. 62 Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision. Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. AAAI 2021.
  187. Rationale Generation for Defeasible Inference They have a work meeting.

    + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. e-SNLI Distant supervision: 62 Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision. Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. AAAI 2021.
  188. Rationale Generation for Defeasible Inference They have a work meeting.

    + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. e-SNLI Distant supervision: LM The definition of a library is… 62 Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision. Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. AAAI 2021.
  189. Rationale Generation for Defeasible Inference 63

  190. Rationale Generation for Defeasible Inference 63 Post hoc Rationalization They

    have a work meeting. + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. Generates a rationale for a given decision (label).
  191. Rationale Generation for Defeasible Inference Trivially rephrasing the label! (“[+]

    implies that [H]”) 63 Post hoc Rationalization They have a work meeting. + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. Generates a rationale for a given decision (label).
  192. Rationale Generation for Defeasible Inference Trivially rephrasing the label! (“[+]

    implies that [H]”) 63 Post hoc Rationalization They have a work meeting. + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. Generates a rationale for a given decision (label). Joint Prediction & Rationalization They have a work meeting. They are in a conference room. + A conference room is where people have meetings at work. They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. - You must be quiet in the library, while work meetings involve talking. Predict the label (strengthener / weakener) and rationalize it.
  193. Rationale Generation for Defeasible Inference More realistic but very challenging

    task! Trivially rephrasing the label! (“[+] implies that [H]”) 63 Post hoc Rationalization They have a work meeting. + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. Generates a rationale for a given decision (label). Joint Prediction & Rationalization They have a work meeting. They are in a conference room. + A conference room is where people have meetings at work. They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. - You must be quiet in the library, while work meetings involve talking. Predict the label (strengthener / weakener) and rationalize it.
  194. Reliable Evaluation 64

  195. Reliable Evaluation Discriminative tasks: A B C 64

  196. Reliable Evaluation Discriminative tasks: A B C Easy to evaluate

    64
  197. Reliable Evaluation Discriminative tasks: A B C Easy to evaluate

    Models are right for the wrong 64
  198. Reliable Evaluation Discriminative tasks: A B C Easy to evaluate

    Models are right for the wrong 65
  199. Reliable Evaluation … Generative tasks: 
 Discriminative tasks: A B

    C Easy to evaluate Models are right for the wrong 65
  200. Reliable Evaluation … Generative tasks: 
 More nuanced & flexible

    than pre-defined labels Discriminative tasks: A B C Easy to evaluate Models are right for the wrong 65
  201. Reliable Evaluation … Generative tasks: 
 More nuanced & flexible

    than pre-defined labels More similar to human reasoning process 
 (no “answer choices”) Discriminative tasks: A B C Easy to evaluate Models are right for the wrong 65
  202. Reliable Evaluation … Generative tasks: 
 More nuanced & flexible

    than pre-defined labels More similar to human reasoning process 
 (no “answer choices”) Infinite answer space 
 (no “guessing” of correct answer) Discriminative tasks: A B C Easy to evaluate Models are right for the wrong 65
  203. Reliable Evaluation … Generative tasks: 
 More nuanced & flexible

    than pre-defined labels More similar to human reasoning process 
 (no “answer choices”) Infinite answer space 
 (no “guessing” of correct answer) No reliable automatic evaluation metric Discriminative tasks: A B C Easy to evaluate Models are right for the wrong 65
  204. Sara wanted to make dinner for some guests. She had

    to order pizza for her friends instead. Generative Evaluation Reliable Evaluation But she didn’t know how to cook. 66
  205. Desiderata: Sara wanted to make dinner for some guests. She

    had to order pizza for her friends instead. Generative Evaluation Reliable Evaluation But she didn’t know how to cook. 66
  206. Desiderata: $ Reward correct answers that are different from the

    reference. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. Generative Evaluation Right before the guests arrived she tasted the food and it tasted bad. Reliable Evaluation But she didn’t know how to cook. 66
  207. Desiderata: $ Reward correct answers that are different from the

    reference. % Penalize incorrect answers that are similar to the reference. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. Generative Evaluation Right before the guests arrived she tasted the food and it tasted bad. She didn’t know how to cook meat. Reliable Evaluation But she didn’t know how to cook. 66
  208. Desiderata: $ Reward correct answers that are different from the

    reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Reliable Evaluation 67
  209. Desiderata: $ Reward correct answers that are different from the

    reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Reliable Evaluation 67
  210. Desiderata: $ Reward correct answers that are different from the

    reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 67
  211. Desiderata: $ Reward correct answers that are different from the

    reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 🔮 Combine metrics 67
  212. Desiderata: $ Reward correct answers that are different from the

    reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 🔮 Combine metrics 🔮 Extrinsic evaluation 67
  213. Desiderata: $ Reward correct answers that are different from the

    reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 🔮 Combine metrics 🔮 Extrinsic evaluation 🔮 Task-specific learned metric (Chen et al., 2020) 67
  214. Desiderata: $ Reward correct answers that are different from the

    reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 🔮 Combine metrics 🔮 Extrinsic evaluation 🔮 Task-specific learned metric (Chen et al., 2020) 🔮 Train discriminator to evaluate the generator (e.g. Martínez-Plumed et al., 2019, Forbes et al., 2020) Social Chemistry 101: Learning to Reason about Social and Moral Norms. Maxwell Forbes, Jena D. Hwang, Vered Shwartz, et al. EMNLP 2020. 67
  215. Outline • Introspective knowledge acquisition through asking questions
 • Nonmonotonic

    reasoning in natural language
 
 
 • Open problems and future directions
  216. Not sensitive to negation (Kassner et al. 2020; Ettinger, 2020)

    69 Open Problems #1 - Limited Precision
  217. Not sensitive to negation (Kassner et al. 2020; Ettinger, 2020)

    69 DirectX is developed by [MASK]. Often predict similar but mutually-exclusive facts (Jiang et al., 2020) Open Problems #1 - Limited Precision
  218. Not sensitive to negation (Kassner et al. 2020; Ettinger, 2020)

    69 DirectX is developed by [MASK]. Often predict similar but mutually-exclusive facts (Jiang et al., 2020) Solution: paraphrase & aggregate Open Problems #1 - Limited Precision
  219. Not sensitive to negation (Kassner et al. 2020; Ettinger, 2020)

    ;FCSBTBSFCMBDLBOEXIJUF .ZTIJSUJTCMVFSFE Don’t differentiate constant vs. contingent facts 69 DirectX is developed by [MASK]. Often predict similar but mutually-exclusive facts (Jiang et al., 2020) Solution: paraphrase & aggregate Open Problems #1 - Limited Precision
  220. 70 LMs lack an understanding of basic physical properties of

    the world (Bisk et al. 2020) LMs lack perceptual knowledge (Forbes et al. 2019, Weir et al., 2020) Open Problems #2 - Limited Coverage
  221. Open Problems #3 - Reporting Bias

  222. % from text $ from people Acquiring Commonsense Knowledge 72

  223. % from text $ from people Impossible to manually enumerate

    Acquiring Commonsense Knowledge 72
  224. % from text $ from people Impossible to manually enumerate

    Reporting bias
 (Gordon and Van Durme, 2013) murdered + killed breathed + exhaled + inhaled Acquiring Commonsense Knowledge 72
  225. % from text $ from people Impossible to manually enumerate

    ' from large-scale neural language models Reporting bias
 (Gordon and Van Durme, 2013) murdered + killed breathed + exhaled + inhaled Acquiring Commonsense Knowledge 72
  226. 
 Do Neural Language Models Overcome Reporting Bias? Vered Shwartz

    and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Open Problems #3 - Reporting Bias
  227. 
 Do Neural Language Models Overcome Reporting Bias? Vered Shwartz

    and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Everyone is dead Overestimate very rare actions Open Problems #3 - Reporting Bias
  228. Overestimate very rare outcomes 
 Do Neural Language Models Overcome

    Reporting Bias? Vered Shwartz and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Everyone is dead Overestimate very rare actions Open Problems #3 - Reporting Bias
  229. Overestimate very rare outcomes The man turned on the faucet.

    As a result, 
 Do Neural Language Models Overcome Reporting Bias? Vered Shwartz and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Everyone is dead Overestimate very rare actions Open Problems #3 - Reporting Bias
  230. Overestimate very rare outcomes The man turned on the faucet.

    As a result, GPT-2 the man’s blood was sprayed everywhere. 
 Do Neural Language Models Overcome Reporting Bias? Vered Shwartz and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Everyone is dead Overestimate very rare actions Open Problems #3 - Reporting Bias
  231. Don’t differentiate generic facts from grounded knowledge about named entities

    "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias
  232. Richard has a bad Don’t differentiate generic facts from grounded

    knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias
  233. GPT-2 Richard has a bad habit of saying things that

    are not true. Don’t differentiate generic facts from grounded knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias
  234. GPT-2 Richard has a bad habit of saying things that

    are not true. Don’t differentiate generic facts from grounded knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias
  235. GPT-2 Richard has a bad Donald has a bad habit

    of saying things that are not true. Don’t differentiate generic facts from grounded knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias
  236. GPT-2 Richard has a bad Donald has a bad habit

    of saying things that are not true. reputation for being a racist. Don’t differentiate generic facts from grounded knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias
  237. GPT-2 Richard has a bad Donald has a bad habit

    of saying things that are not true. reputation for being a racist. Don’t differentiate generic facts from grounded knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias
  238. Learning Commonsense Knowledge from Text Acquiring Commonsense Knowledge 75 The

    way forward: multimodality
  239. Learning Commonsense Knowledge from Text ( from Text, Images and

    Videos << Acquiring Commonsense Knowledge 75 The way forward: multimodality
  240. Learning Commonsense Knowledge from Text ( from Text, Images and

    Videos << GSPOUSPX DSPTTMFHHFE ⇒ MBTUSPX TUBOEJOH ⇒ Acquiring Commonsense Knowledge 75 The way forward: multimodality
  241. Learning Commonsense Knowledge from Text ( from Text, Images and

    Videos << )BOHJOHVQUIFQIPOFXJUIPVUTBZJOHHPPECZF Reporting Bias! Reporting Bias! Acquiring Commonsense Knowledge 76
  242. Learning Commonsense Knowledge from Text ( from Text, Images and

    Videos << )BOHJOHVQUIFQIPOFXJUIPVUTBZJOHHPPECZF Reporting Bias! Reporting Bias! Acquiring Commonsense Knowledge 76
  243. The girl spilt orange juice on herself and started crying.

    “Why are you crying?” her dad asked. “Because my clothes are wet”, replied the girl. “And why are they wet?” 77 Open Problems #4 - Language Generation != Reasoning
  244. The girl spilt orange juice on herself and started crying.

    “Why are you crying?” her dad asked. “Because my clothes are wet”, replied the girl. “And why are they wet?” “Because I fell in the swimming pool.” “And why did you fall in the swimming pool?”
 “Because I couldn’t see the water”, the girl replied.
 The moral of the story is: 
 Always wear a blindfold when you go swimming. 77 Open Problems #4 - Language Generation != Reasoning
  245. Recap

  246. Recap 🤖A framework for discovering implicit knowledge through asking clarification

    questions
  247. Recap 🤖A framework for discovering implicit knowledge through asking clarification

    questions
  248. Recap 🤖A framework for discovering implicit knowledge through asking clarification

    questions 🤖New tasks and models for nonmonotonic reasoning in natural language

  249. Recap 🤖A framework for discovering implicit knowledge through asking clarification

    questions 🤖New tasks and models for nonmonotonic reasoning in natural language
 🤖Still a long way for human-level commonsense reasoning abilities: • Knowledge reliability • Reasoning abilities: deductive, causal, nonmonotonic • “Seeing” the world
  250. Recap 🤖A framework for discovering implicit knowledge through asking clarification

    questions 🤖New tasks and models for nonmonotonic reasoning in natural language
 🤖Still a long way for human-level commonsense reasoning abilities: • Knowledge reliability • Reasoning abilities: deductive, causal, nonmonotonic • “Seeing” the world vereds@allenai.org @VeredShwartz 6JCPM;QW
  251. (1) Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula

    and Yejin Choi. Unsupervised Commonsense Question Answering with Self-Talk. EMNLP 2020. (2) Lianhui (Karen) Qin, Vered Shwartz, Peter West, Chandra Bhagavatula, Jena Hwang, Ronan Le Bras, Antoine Bosselut, and Yejin Choi. Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning. EMNLP 2020. (3) Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, and Yejin Choi. Thinking Like a Skeptic: Defeasible Inference in Natural Language. Findings of EMNLP 2020. (4) Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision. AAAI 2021. (5) Vered Shwartz and Yejin Choi. Do Neural Language Models Overcome Reporting Bias? COLING 2020. (6) Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. arXiv 2021. (7) Maxwell Forbes, Jena D. Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. Social Chemistry 101: Learning to Reason about Social and Moral Norms. EMNLP 2020. (8) Maarten Sap, Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. Introductory Tutorial on Commonsense Reasoning. ACL 2020. (9) Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, Benjamin Van Durme. Hypothesis Only Baselines in Natural Language Inference. *SEM 2018. (10) Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the Behavior of Visual Question Answering Models. EMNLP 2016. (11) Allyson Ettinger. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. TACL 2020 (12) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. NAACL 2019. (13) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for automatic knowledge graph construction. ACL 2019. (14) Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. going on a vacation takes longer than going for a walk: A study of temporal commonsense understanding. EMNLP 2019. (15) Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen-tau Yih, and Yejin Choi. Abductive Commonsense Reasoning. ICLR 2020. (16) Christian Szegedy, et al. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. (17) Fernando Martínez-Plumed, Ricardo B.C. Prudêncio, Adolfo Martínez-Usó, and José Hernández-Orallo. Item response theory in AI: Analysing machine learning classifiers at the instance level. Artificial Intelligence 2019. (18) Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, Verena Rieser. Why We Need New Evaluation Metrics for NLG. EMNLP 2017. References (1) 79
  252. (19) Jerome S Bruner. The act of discovery. Harvard educational

    review. 1961. (20) Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. Workshop on Automated knowledge base construction 2013. (21) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WINOGRANDE: An adversarial winograd schema challenge at scale. AAAI 2020. (22) Lianhui (Karen) Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, and Yejin Choi. Counterfactual Story Reasoning and Generation. EMNLP 2019. (23) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. EMNLP 2019. (24) Nora Kassner and Hinrich Schutze. Negated lama: Birds cannot fly. ACL 2020. (25) Robyn Speer and Catherine Havasi. Representing general relational knowledge in ConceptNET 5. LREC 2012. (26) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, Noah A. Smith. Annotation Artifacts in Natural Language Inference Data. NAACL 2018 (27) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about physical commonsense in natural language. AAAI 2020. (28) Raymond Reiter. A Logic for Default Reasoning. Artificial Intelligence, 1980. (29) Leon A. Gatys, Alexander S. Ecker, Matthias Bethge. Image Style Transfer Using Convolutional Neural Networks. CVPR 2016. (30) Charles Sanders Peirce. Collected papers of Charles Sanders Peirce, volume 5. Harvard University Press, 1965. (31) Nelson Goodman. The problem of counterfactual conditionals. The Journal of Philosophy 1947. (32) Anthony Chen, Gabriel Stanovsky, Sameer Singh, Matt Gardner. MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics. EMNLP 2020. (33) Maxwell Forbes, Ari Holtzman, and Yejin Choi. Do Neural Language Representations Learn Physical Commonsense? CogSci 2019. (34) Nathaniel Weir, Adam Poliak, and Benjamin Van Durme. Probing Neural Language Models for Human Tacit Assumptions. CogSci 2020. References (2) 80