Commonsense Knowledge and Reasoning in Natural Language

Commonsense Knowledge and Reasoning in Natural Language WING NUS NLP
Seminar, July 2021 Vered Shwartz

The LM Pre-training Revolution 2

The LM Pre-training Revolution 2 0 25 50 75 100
Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 90.3

Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 +UNCPIWCIGWPFGTUVCPFKPI PGCTN[ UQNXGF! 90.3

Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 +UNCPIWCIGWPFGTUVCPFKPI PGCTN[ UQNXGF! 90.3 Translation

Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 +UNCPIWCIGWPFGTUVCPFKPI PGCTN[ UQNXGF! 90.3 Translation Reading Comprehension

Jul '19 Aug '19 Sep '19 Oct '19 Nov '19 Dec '19 Jan '20 Feb '20 Mar '20 Apr '20 May '20 Jun '20 Jul '20 Aug '20 Sep '20 Oct '20 Nov '20 Dec '20 Jan '21 Human Performance: Jul ‘19Aug ‘19 Jan ‘20 Jan ‘21 89.3 Baseline 71.5 90.2 84.6 +UNCPIWCIGWPFGTUVCPFKPI PGCTN[ UQNXGF! 90.3 Translation Reading Comprehension Chatbots

Is Natural Language Understanding Nearly Solved? 3 Pre-training

Is Natural Language Understanding Nearly Solved? 3 Pre-training ✅ Syntax
✅ Word meanings ✅ Factual Knowledge ✅ …

Is Natural Language Understanding Nearly Solved? 3 Language Model Pre-training
Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ …

Is Natural Language Understanding Nearly Solved? 3 Language Model The
amazing is cake chocolate Pre-training Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ …

amazing is cake chocolate + - 94.6% 5.4% Pre-training Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ …

amazing is cake chocolate + - 94.6% 5.4% Pre-training Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ … ✅ Understanding the task ✅ Learning to solve the task

amazing is cake chocolate + - 94.6% 5.4% Pre-training Fine-tuning: ✅ Syntax ✅ Word meanings ✅ Factual Knowledge ✅ … ✅ Understanding the task ✅ Learning to solve the task ❓Generalization to unknown situations 9JCVCTGVJGTGOCKPKPIEJCNNGPIGU!

Overfitting to Data-specific Spurious Correlations 4

Overfitting to Data-specific Spurious Correlations 🤖: A horse standing in
the grass. (Szegedy et al., 2015) 4

How many zebras? 🤖: 2 (Agrawal et al., 2016) Overfitting
to Data-specific Spurious Correlations 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4

How many zebras? 🤖: 2 (Agrawal et al., 2016) Overfitting
to Data-specific Spurious Correlations How many dogs? 2 How many zebras? 2 How many giraﬀes? 2 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4

How many zebras? 🤖: 2 (Agrawal et al., 2016) 🤖:
contradiction (91.7%) (Gururangan, Swayamdipta, et al., 2018; Poliak et al., 2018) p: I only had a soup but it was very ﬁlling. h: I didn't eat a salad. Overfitting to Data-specific Spurious Correlations How many dogs? 2 How many zebras? 2 How many giraﬀes? 2 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4

contradiction (91.7%) (Gururangan, Swayamdipta, et al., 2018; Poliak et al., 2018) p: I only had a soup but it was very ﬁlling. h: I didn't eat a salad. Overfitting to Data-specific Spurious Correlations How many dogs? 2 How many zebras? 2 How many giraﬀes? 2 p: The boy ran in the park. h: The boy didn’t run in the park. contradiction p: The boy ran in the park. h: The boy didn’t run in the park. contradiction p: The boy ran in the park. h: The boy didn’t run in the park. contradiction 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4

contradiction (91.7%) (Gururangan, Swayamdipta, et al., 2018; Poliak et al., 2018) p: I only had a soup but it was very ﬁlling. h: I didn't eat a salad. Overfitting to Data-specific Spurious Correlations How many dogs? 2 How many zebras? 2 How many giraﬀes? 2 p: The boy ran in the park. h: The boy didn’t run in the park. contradiction p: The boy ran in the park. h: The boy didn’t run in the park. contradiction p: The boy ran in the park. h: The boy didn’t run in the park. contradiction …Solving datasets but not underlying tasks! 🤖: A horse standing in the grass. (Szegedy et al., 2015) 4

What is Commonsense? Introductory Tutorial on Commonsense Reasoning. Maarten Sap,
Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. 5

Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. It’s a bad idea to touch a hot stove. 5

Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. It’s impolite to comment on people’s weight. It’s a bad idea to touch a hot stove. 5

Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. It’s impolite to comment on people’s weight. It’s a bad idea to touch a hot stove. Eating dinner comes before going to bed. 5

Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. ACL 2020. The basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. It’s impolite to comment on people’s weight. It’s a bad idea to touch a hot stove. Eating dinner comes before going to bed. … 5

Why Do NLP Models Need Commonsense? 6

Why Do NLP Models Need Commonsense? Translation __ __ 6
= yogurt with grass

Why Do NLP Models Need Commonsense? Translation __ __ 6
= yogurt with grass Reading Comprehension 4UFWJF8POEFSBOOPVODFT  IF`MMCFIBWJOHLJEOFZTVSHFSZ  EVSJOH-POEPODPODFSU

Why Do NLP Models Need Commonsense? Chatbots Medical chatbot using
OpenAI’s GPT-3 told a fake patient to kill themselves Translation __ __ 6 = yogurt with grass Reading Comprehension 4UFWJF8POEFSBOOPVODFT  IF`MMCFIBWJOHLJEOFZTVSHFSZ  EVSJOH-POEPODPODFSU

Outline

Outline • Introspective knowledge acquisition through asking questions 

Outline • Introspective knowledge acquisition through asking questions  • Nonmonotonic
reasoning in natural language     

reasoning in natural language      • Open problems and future directions

Children need to eat more vegetables because they are healthy
children vegetables Reasoning with Implicit Knowledge 9 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

children vegetables 7FHFUBCMFTBSFIFBMUIZ Reasoning with Implicit Knowledge 9 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

children vegetables 7FHFUBCMFTBSFIFBMUIZ &BUJOHWFHFUBCMFTDBONBLFZPVIFBMUIJFS Reasoning with Implicit Knowledge 9 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

children vegetables 7FHFUBCMFTBSFIFBMUIZ &BUJOHWFHFUBCMFTDBONBLFZPVIFBMUIJFS 1FPQMFXBOUUPCFIFBMUIZ Reasoning with Implicit Knowledge 9 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Discovery Learning (Bruner, 1961) Children need to eat more vegetables
because they are healthy children vegetables 10 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

because they are healthy children vegetables Learner 10 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

because they are healthy children vegetables Learner Self-Inquiry What are the properties of vegetables? 10 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

because they are healthy children vegetables Learner Self-Inquiry What are the properties of vegetables? Existing Knowledge Vegetables are full of vitamins. 10 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

because they are healthy children vegetables Learner New Facts Self-Inquiry What are the properties of vegetables? Existing Knowledge Vegetables are full of vitamins. 10 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

The Self-Talk Paradigm Children need to eat more vegetables because
they are healthy children vegetables Neural Language Model Self-Inquiry What are the properties of vegetables? Existing Knowledge Vegetables are full of vitamins. Nested QA Main Question Main Answer 11 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Answer choices: Children need to eat more vegetables because they
are healthy. Context: children, vegetables Output Predicted answer choice: vegetables Input 12 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

are healthy. Context: children, vegetables Knowledge Discovery Question Answering Output Predicted answer choice: Input 13 vegetables Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Knowledge Discovery 14

Children need to eat more vegetables because they are healthy.
Instance Knowledge Discovery 15 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Children need to eat more vegetables because they are healthy.
Instance Knowledge Discovery What is the purpose of Nested Question Prefix 15 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

WFHFUBCMFT Children need to eat more vegetables because they
are healthy. Instance Knowledge Discovery What is the purpose of Nested Question Prefix 15 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

The purpose of is Nested Answer Prefix WFHFUBCMFT Children
need to eat more vegetables because they are healthy. Instance Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. 15 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

The purpose of is Nested Answer Prefix WFHFUBCMFT The
purpose of vegetables is Children need to eat more vegetables because they are healthy. Instance Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. 15 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

purpose of vegetables is Children need to eat more vegetables because they are healthy. Instance Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. UPQSPWJEFBHPPECBTFPG OVUSJFOUTBOEFOFSHZ 15 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

purpose of vegetables is Children need to eat more vegetables because they are healthy. Instance The purpose of vegetables is to provide a good base of nutrients and energy. Nested Answers Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. UPQSPWJEFBHPPECBTFPG OVUSJFOUTBOEFOFSHZ 15 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

purpose of vegetables is Children need to eat more vegetables because they are healthy. Instance The purpose of vegetables is to provide a good base of nutrients and energy. Nested Answers The properties of being healthy are linked to the effects of exercise. The deﬁnition of healthy is quality of life that is free of diseases. Knowledge Discovery What is the purpose of Nested Question Prefix Children need to eat more vegetables because they are healthy. UPQSPWJEFBHPPECBTFPG OVUSJFOUTBOEFOFSHZ 15 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

are healthy. Context: children, vegetables The purpose of vegetables is to provide a good base of nutrients and energy. The properties of being healthy are linked to the effects of exercise. The deﬁnition of healthy is quality of life that is free of diseases. LM What is the deﬁnition of healthy? Output Predicted answer choice: Input Knowledge Discovery Question Answering What are the properties of a What is the purpose of What is the main function of … 16 vegetables Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Question Answering 17

Question Answering Children Vegetables Children need to eat more vegetables
because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The deﬁnition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The deﬁnition of healthy is quality of life that is free of diseases. children vegetables 18 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Question Answering Most plausible statement Children Vegetables Children need to
eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The deﬁnition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The deﬁnition of healthy is quality of life that is free of diseases. children vegetables 18 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Question Answering Most plausible statement Statement with best language model
score Most plausible statement Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The deﬁnition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The deﬁnition of healthy is quality of life that is free of diseases. children vegetables 18 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

score Most plausible statement Language Model 19 Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The deﬁnition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The deﬁnition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

score Most plausible statement Children need … and Language Model energy p(need|Children) p(to|Children need) . . . p(energy| . . . ) p(<eos>| . . . ) 19 Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The deﬁnition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The deﬁnition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

score Most plausible statement Children need … and Language Model energy p(need|Children) p(to|Children need) . . . p(energy| . . . ) p(<eos>| . . . ) score = − 1 n log( ) 19 Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The deﬁnition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The deﬁnition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

score Most plausible statement Children need … and Language Model energy p(need|Children) p(to|Children need) . . . p(energy| . . . ) p(<eos>| . . . ) score = − 1 n log( ) 19 Children Vegetables Children need to eat more vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The deﬁnition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The deﬁnition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020. We’ll get back to this soon…

Question Answering 20 Children Vegetables Children need to eat more
vegetables because children are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. Children need to eat more vegetables because vegetables are healthy. The purpose of vegetables is to provide a good base of nutrients and energy. … … Children need to eat more vegetables because children are healthy. The deﬁnition of healthy is quality of life that is free of diseases. Children need to eat more vegetables because vegetables are healthy. The deﬁnition of healthy is quality of life that is free of diseases. children vegetables Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

are healthy. Context: children, vegetables LM What is the deﬁnition of healthy? LM Output Predicted answer choice: vegetables Input The purpose of vegetables is to provide a good base of nutrients and energy. The properties of being healthy are linked to the effects of exercise. The deﬁnition of healthy is quality of life that is free of diseases. Knowledge Discovery What are the properties of a What is the purpose of What is the main function of … Answer with most plausible statement Question Answering 21 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

MC-TACO Temporal SocialIQa Social PIQA Physical CommonsenseQA WinoGrande General COPA
Causal Commonsense Question Answering Tasks 22 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Causal Social Interaction QA Although Aubrey was older and stronger, they lost to Alex in arm wrestling. How would Alex feel as a result?  1) they need to practice more. 2) ashamed. 3) boastful. Commonsense Question Answering Tasks 22 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Causal Choice of Plausible Alternatives The man broke his toe. What was the cause? 1) He got a hole in his sock. 2) He dropped a hammer on his foot. Social Interaction QA Although Aubrey was older and stronger, they lost to Alex in arm wrestling. How would Alex feel as a result?  1) they need to practice more. 2) ashamed. 3) boastful. Commonsense Question Answering Tasks 22 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Baselines Children need to eat more vegetables because they are
healthy. 23 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

healthy. ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

healthy. Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

healthy. vegetables, healthy Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

healthy. vegetables, healthy vegetables healthy eating vegetables required for motivated by Vegetables are required for eating vegetables. Eating vegetables is motivated by being healthy. Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

healthy. vegetables, healthy vegetables healthy eating vegetables required for motivated by Vegetables are required for eating vegetables. Eating vegetables is motivated by being healthy. Vegetables are healthy. Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

healthy. vegetables, healthy vegetables healthy eating vegetables required for motivated by Vegetables are required for eating vegetables. Eating vegetables is motivated by being healthy. Vegetables are healthy. cause Because the children wanted to live longer. Expert Knowledge ∅ No Inquiry 23 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

No Inquiry Expert Knowledge Self-Talk Human 0 25 50 75
100 Results 24 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

1. Nested QA improves performance No Inquiry Expert Knowledge Self-Talk
Human 0 25 50 75 100 Results 24 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

1. Nested QA improves performance 2. Self-Talk performs similarly to
models with expert knowledge No Inquiry Expert Knowledge Self-Talk Human 0 25 50 75 100 Results 24 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

models with expert knowledge 3. Gap from human performance No Inquiry Expert Knowledge Self-Talk Human 0 25 50 75 100 Results 24 Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

models with expert knowledge 3. Gap from human performance No Inquiry Expert Knowledge Self-Talk Human 0 25 50 75 100 Results 24 What should I ask about? Unsupervised Commonsense Question Answering with Self-Talk.   Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula and Yejin Choi. EMNLP 2020.

Measuring Plausibility 25 Surface Form Competition: Why the Highest Probability
Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.

Measuring Plausibility 26 A human … whirlpool Language Model bath
p(human|A) p(wants|A human) . . . p(bath| . . . ) p(<eos>| . . . ) score = − 1 n log( ) Standard language model probability: Zero-shot Models Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. argmaxi P(ansi |question) Predict:

p(human|A) p(wants|A human) . . . p(bath| . . . ) p(<eos>| . . . ) score = − 1 n log( ) Standard language model probability: Zero-shot Models Confounders: (1) String length (2) Word frequency Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. argmaxi P(ansi |question) Predict:

p(human|A) p(wants|A human) . . . p(bath| . . . ) p(<eos>| . . . ) score = − 1 n log( ) Standard language model probability: Normalize by length Zero-shot Models Confounders: (1) String length (2) Word frequency Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. argmaxi P(ansi |question) Predict:

Measuring Plausibility 27 Surface Form Competition Surface Form Competition: Why
the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. Diﬀerent strings represent the same concept but compete for probability!

Measuring Plausibility 27 Surface Form Competition Surface Form Competition: Why
the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021. Diﬀerent strings represent the same concept but compete for probability! * Not considering multiple possible correct answers

Measuring Plausibility 28 Surface Form Competition argmaxi P(ansi |question) Probability
Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.

Measuring Plausibility 28 Surface Form Competition argmaxi P(ansi |question) P(ansi
|domain) Domain Conditional PMI argmaxi P(ansi |question) Probability Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.

|domain) Domain Conditional PMI argmaxi P(ansi |question) Probability e.g. "The answer is" Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.

|domain) Domain Conditional PMI argmaxi P(ansi |question) Probability e.g. "The answer is" Prior probability of each answer Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.

|domain) Domain Conditional PMI argmaxi P(ansi |question) Probability e.g. "The answer is" Prior probability of each answer Domain Conditional PMI consistently outperforms other zero-shot scoring methods across multiple-choice tasks (QA, entailment, text classification), for GPT-2 & GPT-3! Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right.   Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. arXiv 2021.

Nonmonotonic Reasoning 30

Nonmonotonic Reasoning ART (Bhagavatula et al., 2020) Most plausible Abductive
reasoning 30

Nonmonotonic Reasoning TimeTravel (Qin et al., 2019) Counterfactual reasoning What
if? ART (Bhagavatula et al., 2020) Most plausible Abductive reasoning 30

Nonmonotonic Reasoning TimeTravel (Qin et al., 2019) Counterfactual reasoning What
if? -NLI (Rudinger et al., 2020) δ Defeasible reasoning Updating inferences with additional ART (Bhagavatula et al., 2020) Most plausible Abductive reasoning 30

Abductive Reasoning (Peirce, 1965) Reason about the most plausible explanation
for incomplete observations. 31

for incomplete observations. Sara wanted to make dinner for some guests. 31 ART (Bhagavatula et al., 2020)

for incomplete observations. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. 31 ART (Bhagavatula et al., 2020)

for incomplete observations. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. But she didn’t know how to cook. 31 ART (Bhagavatula et al., 2020)

for incomplete observations. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. But she didn’t know how to cook. 31 ART (Bhagavatula et al., 2020) Useful for filling in gaps in story understanding

Challenge: Language models are conditioned only on a past context
Sara wanted to make dinner for some guests. 32

Sara wanted to make dinner for some guests. GPT-2 "I'm going to go grab some rice noodles," she says. 32

Sara wanted to make dinner for some guests. GPT-2 "I'm going to go grab some rice noodles," she says. Solution: compute loss w.r.t future constraints & backpropagate to the output 32

Sara wanted to make dinner for some guests. GPT-2 "I'm going to go grab some rice noodles," she says. Solution: compute loss w.r.t future constraints & backpropagate to the output Inspiration: Image Style Transfer   (Gatys et al, 2016) ConvNe Backpropagation Output: Loss Inputs: Source Image Style 32

X - past context Input Z - future constraints Sara
wanted to make dinner for some guests. She had to order pizza for her friends instead. 33 DELOREAN DEcoding for nonmonotonic LOgical REAsoNing Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning.   Lianhui (Karen) Qin, Vered Shwartz, Peter West, Chandra Bhagavatula, Jena Hwang, Ronan Le Bras, Antoine Bosselut, and Yejin Choi. EMNLP 2020.

Output Y - continuation • Fluent continuation of X •
Satisfies the constraints Z X - past context Input Z - future constraints Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. 33 DELOREAN DEcoding for nonmonotonic LOgical REAsoNing Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning.   Lianhui (Karen) Qin, Vered Shwartz, Peter West, Chandra Bhagavatula, Jena Hwang, Ronan Le Bras, Antoine Bosselut, and Yejin Choi. EMNLP 2020.

X - past context Z - future constraints Sara wanted
to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Backward Pass Forward Pass Input 34 DELOREAN

Initialization 34

Sara wanted to make dinner for some guests. x1 x2
xNX … X LM Z She had to order pizza for her friends instead. Initialization 35

xNX … X … ˜ y2 ˜ yN ˜ y1 LM Z She had to order pizza for her friends instead. Initialization 35

xNX … X … ˜ y2 ˜ yN ˜ y1 LM ˜ Y Z She had to order pizza for her friends instead. Initialization 35

to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass ˜ Y = decode N tokens from LMforward(X) 36 DELOREAN

Backward Pass 37

xNX … X Z … ˜ y2 ˜ yN ˜ y1 LM She had to order pizza for her friends instead. LM Backward Pass 38

xNX … X Z … ˜ y2 ˜ yN ˜ y1 LM She had to order pizza for her friends instead. … She had to </s> z1 z2 … z3 zNZ LM Backward Pass 38

xNX … X … ˜ y2 ˜ yN ˜ y1 LM … She had to </s> z1 z2 … z3 zNZ Task-specific Loss Function LM Z She had to order pizza for her friends instead. Backward Pass 39

xNX … X … ˜ y2 ˜ yN ˜ y1 LM … She had to </s> z1 z2 … z3 zNZ ˜ yb 1 ˜ yb 2 ˜ yb N Backpropagation … Task-specific Loss Function LM Z She had to order pizza for her friends instead. Backward Pass 39

xNX … X … ˜ y2 ˜ yN ˜ y1 LM … She had to </s> z1 z2 … z3 zNZ ˜ yb 1 ˜ yb 2 ˜ yb N Backpropagation … Task-specific Loss Function LM Z She had to order pizza for her friends instead. Backward Pass Task-specific Loss Function Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) 40

to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) ˜ Y = decode N tokens from LMforward(X) 41 DELOREAN

Forward Pass 42

xNX … X LM Z She had to order pizza for her friends instead. ˜ yb 1 ˜ yb 2 ˜ yb N ˜ yN ˜ y2 ˜ y1 … … Forward Pass 43

xNX … X LM z1 z2 … z3 zNZ Z She had to order pizza for her friends instead. ˜ yb 1 ˜ yb 2 ˜ yb N ˜ yN ˜ y2 ˜ y1 … … ˜ yf 1 ˜ yf 2 ˜ yf N … … Forward Pass 43

to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) ˜ Y = decode N tokens from LMforward(X) And mix with backward logits ˜ Y = decode N tokens from LMforward(X) 44 DELOREAN

to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) ˜ Y = decode N tokens from LMforward(X) And mix with backward logits ˜ Y = decode N tokens from LMforward(X) Generation 45 DELOREAN

Generation

xNX … X LM z1 z2 … z3 zNZ Z She had to order pizza for her friends instead. Generation ˜ yb 1 ˜ yb 2 ˜ yb N ˜ yN ˜ y2 ˜ y1 … … ˜ yf 1 ˜ yf 2 ˜ yf N … … 47

xNX … X LM Greedy decoding z1 z2 … z3 zNZ Z She had to order pizza for her friends instead. Y But she didn’t know how to cook. Generation ˜ yb 1 ˜ yb 2 ˜ yb N ˜ yN ˜ y2 ˜ y1 … … ˜ yf 1 ˜ yf 2 ˜ yf N … … 47

to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation • Fluent continuation of X • Satisfies the constraints Z Initialization Input Backward Pass Forward Pass Maximize the likelihood of LM to generate the future observation following the past observation and the generated hypothesis Z X ˜ Y ℒ(X, ˜ Y, Z) := − ∑ NZ n=1 log PLM (˜ zn |X, ˜ Y, Z1:n−1 ) ˜ Y = decode N tokens from LMforward(X) And mix with backward logits ˜ Y = decode N tokens from LMforward(X) Greedy decoding from ˜ Y 48 DELOREAN

to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation Initialization Input Backward Pass Forward Pass Greedy Decoding 49 DELOREAN

to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation Initialization Input Backward Pass Forward Pass Greedy Decoding Repeat  T times 50 DELOREAN

to make dinner for some guests. She had to order pizza for her friends instead. Output Y - continuation Initialization Input Backward Pass Forward Pass Greedy Decoding Repeat  T times Select Best Y 51 DELOREAN

Select Best Y

Select that is most likely to follow and precede its
adjacent sentences Y(t) Select Best Y 53

score(Y(t)) = BERTNSP (XY(t), Z) + BERTNSP (X, Y(t)Z) Select
that is most likely to follow and precede its adjacent sentences Y(t) P(She had to order pizza for her friends instead.| Sara wanted to make dinner for some guests. But she didn’t know how to cook.) P(But she didn’t know how to cook. She had to order pizza for her friends instead.|Sara wanted to make dinner for some guests.) Select Best Y 53

Human Evaluation Results Abductive Reasoning 54

Human Evaluation Results Abductive Reasoning Coherence 0 2.5 5 7.5
10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 54

10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 2.97 3.25 5.22 54

10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 2.36 2.38 4.74 2.97 3.25 5.22 1. Outperforms unsupervised models substantially 54

10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 3.14 3.75 5.1 2.36 2.38 4.74 2.97 3.25 5.22 1. Outperforms unsupervised models substantially 2. Competitive with supervised models! 54

10 X-Y Y-Z X-Y-Z DELOREAN Unsupervised Supervised Human 7.83 8.23 8.79 3.14 3.75 5.1 2.36 2.38 4.74 2.97 3.25 5.22 1. Outperforms unsupervised models substantially 2. Competitive with supervised models! 3. Large gap from human performance 54

Example Generations 55 Sara wanted to make dinner for some
guests. She had to order pizza for her friends instead.

Example Generations 1. She was thinking about the best way.
55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.

Example Generations Backward pass introduces: contrast! 1. She was thinking
about the best way. 2.However, her cooking skills were the only thing that could make it a success. 55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.

about the best way. 2.However, her cooking skills were the only thing that could make it a success. 3.But she couldn’t, because she was too busy with her work. 55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.

about the best way. 2.However, her cooking skills were the only thing that could make it a success. 3.But she couldn’t, because she was too busy with her work. 4.But she didn’t have the money and she didn’t have her own kitchen. 55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.

about the best way. 2.However, her cooking skills were the only thing that could make it a success. 3.But she couldn’t, because she was too busy with her work. 4.But she didn’t have the money and she didn’t have her own kitchen. 5.But she didn’t know how to cook. 55 Sara wanted to make dinner for some guests. She had to order pizza for her friends instead.

56 Counterfactual Reasoning (Goodman, 1947) Reason about changes in outcomes
given a change in conditions.

given a change in conditions. Useful for Argument Mining: If X would have happened, it would result in some unwanted outcome Y.

given a change in conditions. Useful for Argument Mining: If X would have happened, it would result in some unwanted outcome Y. Useful for Detecting Misinformation: Claim X is false because it entails claim Y which is known to be false.

TimeTravel (Qin et al., 2019) Lisa was throwing a Halloween
party. Original Story Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 57 All her friends were dressing up.

party. Original Story Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 57 All her friends were dressing up. All her friends were dressing up. It was a Game of Thrones themed party. Lisa was throwing a Halloween party. Counterfactual Beginning

party. Original Story Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 57 All her friends were dressing up. Alternative Ending: $ Adheres to the counterfactual beginning % Minimally edits the original ending All her friends were dressing up. It was a Game of Thrones themed party. Lisa was throwing a Halloween party. Counterfactual Beginning

party. Original Story Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 57 All her friends were dressing up. Alternative Ending: $ Adheres to the counterfactual beginning % Minimally edits the original ending All her friends were dressing up. It was a Game of Thrones themed party. Lisa was throwing a Halloween party. Counterfactual Beginning Lisa thought about being a wizard how she would dress up as a Lannister, but she didn’t want to look like a Lannister. Then she decided on a scarier costume. She wanted to look like a Stark. Lisa dressed up like a vampire Stark.

Output $ Adheres to the counterfactual story beginning % Minimally
edits the original ending Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 58 Y - alternative ending

edits the original ending Initialization Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. 58 Y - alternative ending

edits the original ending Initialization Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. Backward Pass Minimize the KL divergence between the original ending Z (one-hot representation) and generated ending ˜ Y ℒ(X, ˜ Y, Z) := KL (Z∥ softmax( ˜ Y/τ)) 58 Y - alternative ending

edits the original ending Initialization Forward Pass Generation + Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. Backward Pass Minimize the KL divergence between the original ending Z (one-hot representation) and generated ending ˜ Y ℒ(X, ˜ Y, Z) := KL (Z∥ softmax( ˜ Y/τ)) Select best Y 58 Y - alternative ending

edits the original ending Initialization Forward Pass Generation + Counterfactual Reasoning X - counterfactual beginning Z - original ending Lisa was throwing a Halloween party. All her friends were dressing up. It was a Game of Thrones themed party. Input Lisa thought about being a wizard. Then she decided on a scarier costume. Lisa dressed up like a vampire. Backward Pass Minimize the KL divergence between the original ending Z (one-hot representation) and generated ending ˜ Y ℒ(X, ˜ Y, Z) := KL (Z∥ softmax( ˜ Y/τ)) Select best Y DeLorean was the only method to achieves a good balance between the two requirements 58 Y - alternative ending

Defeasible Inference (Reiter, 1980) Given premise P, a hypothesis H
is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. 59

is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. P: Tweety is a bird. 59

is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. P: Tweety is a bird. H: Tweety ﬂies. 59

is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. P: Tweety is a bird. H: Tweety ﬂies. U: Tweety is a penguin. 59

is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. P: Tweety is a bird. H: Tweety ﬂies. U: Tweety is a penguin. 59 Useful for Real-time Summarization: Facts change as the story unfolds.

Defeasible Inference in Natural Language An update U is called
a weakener if, given a premise P and hypothesis H, a human would most likely find H less likely to be true after learning U; if they would find H more likely to be true, then we call U a strengthener. P: Tweety is a bird. H: Tweety ﬂies. Weakener: Tweety is a penguin. Thinking Like a Skeptic: Defeasible Inference in Natural Language.   Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, and Yejin Choi. Findings of EMNLP 2020.

Defeasible Inference in Natural Language An update U is called
a weakener if, given a premise P and hypothesis H, a human would most likely find H less likely to be true after learning U; if they would find H more likely to be true, then we call U a strengthener. P: Tweety is a bird. H: Tweety ﬂies. Weakener: Tweety is a penguin. Strengthener: Tweety is on a tree. Thinking Like a Skeptic: Defeasible Inference in Natural Language.   Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, and Yejin Choi. Findings of EMNLP 2020.

Defeasible Inference in Natural Language 61

Defeasible Inference in Natural Language Discriminative Task They have a
work meeting. They are in a conference room. + Strengthener They are in a library. - Weakener A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. Determine whether an update weakens or strengthens the hypothesis. 61

Defeasible Inference in Natural Language Discriminative Task They have a
work meeting. They are in a conference room. + Strengthener They are in a library. - Weakener A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. Determine whether an update weakens or strengthens the hypothesis. 61 Generative Task They have a work meeting. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. They are in a conference room. + They are in a library. - Generate a weakening or strengthening update for a given premise-hypothesis pair.

Defeasible Inference in Natural Language Language models leave plenty of
room for improvement on the generative task! Discriminative Task They have a work meeting. They are in a conference room. + Strengthener They are in a library. - Weakener A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. Determine whether an update weakens or strengthens the hypothesis. 61 Generative Task They have a work meeting. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. They are in a conference room. + They are in a library. - Generate a weakening or strengthening update for a given premise-hypothesis pair.

Rationale Generation for Defeasible Inference 62 Learning to Rationalize for
Nonmonotonic Reasoning with Distant Supervision. Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. AAAI 2021.

Rationale Generation for Defeasible Inference They have a work meeting.
+ They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. 62 Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision. Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. AAAI 2021.

+ They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. e-SNLI Distant supervision: 62 Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision. Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. AAAI 2021.

+ They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. e-SNLI Distant supervision: LM The deﬁnition of a library is… 62 Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision. Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. AAAI 2021.

Rationale Generation for Defeasible Inference 63

Rationale Generation for Defeasible Inference 63 Post hoc Rationalization They
have a work meeting. + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. Generates a rationale for a given decision (label).

Rationale Generation for Defeasible Inference Trivially rephrasing the label! (“[+]
implies that [H]”) 63 Post hoc Rationalization They have a work meeting. + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. Generates a rationale for a given decision (label).

Rationale Generation for Defeasible Inference Trivially rephrasing the label! (“[+]
implies that [H]”) 63 Post hoc Rationalization They have a work meeting. + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. Generates a rationale for a given decision (label). Joint Prediction & Rationalization They have a work meeting. They are in a conference room. + A conference room is where people have meetings at work. They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. - You must be quiet in the library, while work meetings involve talking. Predict the label (strengthener / weakener) and rationalize it.

Rationale Generation for Defeasible Inference More realistic but very challenging
task! Trivially rephrasing the label! (“[+] implies that [H]”) 63 Post hoc Rationalization They have a work meeting. + They are in a conference room. A conference room is where people have meetings at work. - They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. You must be quiet in the library, while work meetings involve talking. Generates a rationale for a given decision (label). Joint Prediction & Rationalization They have a work meeting. They are in a conference room. + A conference room is where people have meetings at work. They are in a library. A group of people sitting around a rectangular table having either pieces of paper or laptops in front of them. - You must be quiet in the library, while work meetings involve talking. Predict the label (strengthener / weakener) and rationalize it.

Reliable Evaluation 64

Reliable Evaluation Discriminative tasks: A B C 64

Reliable Evaluation Discriminative tasks: A B C Easy to evaluate
64

Models are right for the wrong 64

Models are right for the wrong 65

Reliable Evaluation … Generative tasks:   Discriminative tasks: A B
C Easy to evaluate Models are right for the wrong 65

Reliable Evaluation … Generative tasks:   More nuanced & flexible
than pre-defined labels Discriminative tasks: A B C Easy to evaluate Models are right for the wrong 65

than pre-defined labels More similar to human reasoning process   (no “answer choices”) Discriminative tasks: A B C Easy to evaluate Models are right for the wrong 65

than pre-defined labels More similar to human reasoning process   (no “answer choices”) Infinite answer space   (no “guessing” of correct answer) Discriminative tasks: A B C Easy to evaluate Models are right for the wrong 65

than pre-defined labels More similar to human reasoning process   (no “answer choices”) Infinite answer space   (no “guessing” of correct answer) No reliable automatic evaluation metric Discriminative tasks: A B C Easy to evaluate Models are right for the wrong 65

Sara wanted to make dinner for some guests. She had
to order pizza for her friends instead. Generative Evaluation Reliable Evaluation But she didn’t know how to cook. 66

Desiderata: Sara wanted to make dinner for some guests. She
had to order pizza for her friends instead. Generative Evaluation Reliable Evaluation But she didn’t know how to cook. 66

Desiderata: $ Reward correct answers that are diﬀerent from the
reference. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. Generative Evaluation Right before the guests arrived she tasted the food and it tasted bad. Reliable Evaluation But she didn’t know how to cook. 66

reference. % Penalize incorrect answers that are similar to the reference. Sara wanted to make dinner for some guests. She had to order pizza for her friends instead. Generative Evaluation Right before the guests arrived she tasted the food and it tasted bad. She didn’t know how to cook meat. Reliable Evaluation But she didn’t know how to cook. 66

reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Reliable Evaluation 67

reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Reliable Evaluation 67

reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 67

reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 🔮 Combine metrics 67

reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 🔮 Combine metrics 🔮 Extrinsic evaluation 67

reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 🔮 Combine metrics 🔮 Extrinsic evaluation 🔮 Task-specific learned metric (Chen et al., 2020) 67

reference. % Penalize incorrect answers that are similar to the reference. Generative Evaluation Lexical Overlap Metrics: BLEU, ROUGE, METEOR, CIDEr $ % lexical variation Weak correlation with human judgement (Novikova et al., 2017). Semantic Similarity Based Metrics: SPICE, BERTScore, BLEURT $ % Relatedness / similarity is very fuzzy! Reliable Evaluation 🔮 Combine metrics 🔮 Extrinsic evaluation 🔮 Task-specific learned metric (Chen et al., 2020) 🔮 Train discriminator to evaluate the generator (e.g. Martínez-Plumed et al., 2019, Forbes et al., 2020) Social Chemistry 101: Learning to Reason about Social and Moral Norms. Maxwell Forbes, Jena D. Hwang, Vered Shwartz, et al. EMNLP 2020. 67

Not sensitive to negation (Kassner et al. 2020; Ettinger, 2020)
69 Open Problems #1 - Limited Precision

69 DirectX is developed by [MASK]. Often predict similar but mutually-exclusive facts (Jiang et al., 2020) Open Problems #1 - Limited Precision

69 DirectX is developed by [MASK]. Often predict similar but mutually-exclusive facts (Jiang et al., 2020) Solution: paraphrase & aggregate Open Problems #1 - Limited Precision

;FCSBTBSFCMBDLBOEXIJUF .ZTIJSUJTCMVFSFE Don’t diﬀerentiate constant vs. contingent facts 69 DirectX is developed by [MASK]. Often predict similar but mutually-exclusive facts (Jiang et al., 2020) Solution: paraphrase & aggregate Open Problems #1 - Limited Precision

70 LMs lack an understanding of basic physical properties of
the world (Bisk et al. 2020) LMs lack perceptual knowledge (Forbes et al. 2019, Weir et al., 2020) Open Problems #2 - Limited Coverage

Open Problems #3 - Reporting Bias

% from text $ from people Acquiring Commonsense Knowledge 72

% from text $ from people Impossible to manually enumerate
Acquiring Commonsense Knowledge 72

Reporting bias  (Gordon and Van Durme, 2013) murdered + killed breathed + exhaled + inhaled Acquiring Commonsense Knowledge 72

' from large-scale neural language models Reporting bias  (Gordon and Van Durme, 2013) murdered + killed breathed + exhaled + inhaled Acquiring Commonsense Knowledge 72

  Do Neural Language Models Overcome Reporting Bias? Vered Shwartz
and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Open Problems #3 - Reporting Bias

  Do Neural Language Models Overcome Reporting Bias? Vered Shwartz
and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Everyone is dead Overestimate very rare actions Open Problems #3 - Reporting Bias

Overestimate very rare outcomes   Do Neural Language Models Overcome
Reporting Bias? Vered Shwartz and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Everyone is dead Overestimate very rare actions Open Problems #3 - Reporting Bias

Overestimate very rare outcomes The man turned on the faucet.
As a result,   Do Neural Language Models Overcome Reporting Bias? Vered Shwartz and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Everyone is dead Overestimate very rare actions Open Problems #3 - Reporting Bias

Overestimate very rare outcomes The man turned on the faucet.
As a result, GPT-2 the man’s blood was sprayed everywhere.   Do Neural Language Models Overcome Reporting Bias? Vered Shwartz and Yejin Choi. COLING 2020. Capture facts not explicitly mentioned in the corpus Non zero probability for trivial facts ⇒ Everyone is dead Overestimate very rare actions Open Problems #3 - Reporting Bias

Don’t diﬀerentiate generic facts from grounded knowledge about named entities
"You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias

Richard has a bad Don’t diﬀerentiate generic facts from grounded
knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias

GPT-2 Richard has a bad habit of saying things that
are not true. Don’t diﬀerentiate generic facts from grounded knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias

GPT-2 Richard has a bad Donald has a bad habit
of saying things that are not true. Don’t diﬀerentiate generic facts from grounded knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias

GPT-2 Richard has a bad Donald has a bad habit
of saying things that are not true. reputation for being a racist. Don’t diﬀerentiate generic facts from grounded knowledge about named entities "You are grounded!": Latent Name Artifacts in Pre-trained Language Models. Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. EMNLP 2020. Open Problems #3 - Reporting Bias

Learning Commonsense Knowledge from Text Acquiring Commonsense Knowledge 75 The
way forward: multimodality

Learning Commonsense Knowledge from Text ( from Text, Images and
Videos << Acquiring Commonsense Knowledge 75 The way forward: multimodality

Videos << GSPOUSPX DSPTTMFHHFE ⇒ MBTUSPX TUBOEJOH ⇒ Acquiring Commonsense Knowledge 75 The way forward: multimodality

Videos << )BOHJOHVQUIFQIPOFXJUIPVUTBZJOHHPPECZF Reporting Bias! Reporting Bias! Acquiring Commonsense Knowledge 76

The girl spilt orange juice on herself and started crying.
“Why are you crying?” her dad asked. “Because my clothes are wet”, replied the girl. “And why are they wet?” 77 Open Problems #4 - Language Generation != Reasoning

The girl spilt orange juice on herself and started crying.
“Why are you crying?” her dad asked. “Because my clothes are wet”, replied the girl. “And why are they wet?” “Because I fell in the swimming pool.” “And why did you fall in the swimming pool?”  “Because I couldn’t see the water”, the girl replied.  The moral of the story is:   Always wear a blindfold when you go swimming. 77 Open Problems #4 - Language Generation != Reasoning

Recap 🤖A framework for discovering implicit knowledge through asking clarification
questions

questions 🤖New tasks and models for nonmonotonic reasoning in natural language 

questions 🤖New tasks and models for nonmonotonic reasoning in natural language  🤖Still a long way for human-level commonsense reasoning abilities: • Knowledge reliability • Reasoning abilities: deductive, causal, nonmonotonic • “Seeing” the world

questions 🤖New tasks and models for nonmonotonic reasoning in natural language  🤖Still a long way for human-level commonsense reasoning abilities: • Knowledge reliability • Reasoning abilities: deductive, causal, nonmonotonic • “Seeing” the world [email protected] @VeredShwartz 6JCPM;QW

(1) Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula
and Yejin Choi. Unsupervised Commonsense Question Answering with Self-Talk. EMNLP 2020. (2) Lianhui (Karen) Qin, Vered Shwartz, Peter West, Chandra Bhagavatula, Jena Hwang, Ronan Le Bras, Antoine Bosselut, and Yejin Choi. Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning. EMNLP 2020. (3) Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, and Yejin Choi. Thinking Like a Skeptic: Defeasible Inference in Natural Language. Findings of EMNLP 2020. (4) Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision. AAAI 2021. (5) Vered Shwartz and Yejin Choi. Do Neural Language Models Overcome Reporting Bias? COLING 2020. (6) Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. arXiv 2021. (7) Maxwell Forbes, Jena D. Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. Social Chemistry 101: Learning to Reason about Social and Moral Norms. EMNLP 2020. (8) Maarten Sap, Vered Shwartz, Antoine Bosselut, Dan Roth, and Yejin Choi. Introductory Tutorial on Commonsense Reasoning. ACL 2020. (9) Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, Benjamin Van Durme. Hypothesis Only Baselines in Natural Language Inference. *SEM 2018. (10) Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the Behavior of Visual Question Answering Models. EMNLP 2016. (11) Allyson Ettinger. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. TACL 2020 (12) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. NAACL 2019. (13) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for automatic knowledge graph construction. ACL 2019. (14) Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. going on a vacation takes longer than going for a walk: A study of temporal commonsense understanding. EMNLP 2019. (15) Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen-tau Yih, and Yejin Choi. Abductive Commonsense Reasoning. ICLR 2020. (16) Christian Szegedy, et al. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. (17) Fernando Martínez-Plumed, Ricardo B.C. Prudêncio, Adolfo Martínez-Usó, and José Hernández-Orallo. Item response theory in AI: Analysing machine learning classifiers at the instance level. Artificial Intelligence 2019. (18) Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, Verena Rieser. Why We Need New Evaluation Metrics for NLG. EMNLP 2017. References (1) 79

(19) Jerome S Bruner. The act of discovery. Harvard educational
review. 1961. (20) Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. Workshop on Automated knowledge base construction 2013. (21) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WINOGRANDE: An adversarial winograd schema challenge at scale. AAAI 2020. (22) Lianhui (Karen) Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, and Yejin Choi. Counterfactual Story Reasoning and Generation. EMNLP 2019. (23) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. EMNLP 2019. (24) Nora Kassner and Hinrich Schutze. Negated lama: Birds cannot fly. ACL 2020. (25) Robyn Speer and Catherine Havasi. Representing general relational knowledge in ConceptNET 5. LREC 2012. (26) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, Noah A. Smith. Annotation Artifacts in Natural Language Inference Data. NAACL 2018 (27) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about physical commonsense in natural language. AAAI 2020. (28) Raymond Reiter. A Logic for Default Reasoning. Artificial Intelligence, 1980. (29) Leon A. Gatys, Alexander S. Ecker, Matthias Bethge. Image Style Transfer Using Convolutional Neural Networks. CVPR 2016. (30) Charles Sanders Peirce. Collected papers of Charles Sanders Peirce, volume 5. Harvard University Press, 1965. (31) Nelson Goodman. The problem of counterfactual conditionals. The Journal of Philosophy 1947. (32) Anthony Chen, Gabriel Stanovsky, Sameer Singh, Matt Gardner. MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics. EMNLP 2020. (33) Maxwell Forbes, Ari Holtzman, and Yejin Choi. Do Neural Language Representations Learn Physical Commonsense? CogSci 2019. (34) Nathaniel Weir, Adam Poliak, and Benjamin Van Durme. Probing Neural Language Models for Human Tacit Assumptions. CogSci 2020. References (2) 80

Commonsense Knowledge and Reasoning in Natural ...

Commonsense Knowledge and Reasoning in Natural Language

More Decks by wing.nus

Other Decks in Research

Featured

Transcript